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ABSTRACT 

An approach was developed based on item-response 
models defined at the level of salient subject groups rather than at 
the level of individuals, designed for use with mul„t iple-matr ix 
sampling designs. In each of three National Assessment of Educational 
Progress (NAEP) mathematics subtopics, Reiser's group-effects latent 
trait model was fitted to the proportions of correct response to 
ite;ns as observed in the cells of a design. Item parameters and 
contrasts among demographic groups were estimated in each of four 
data sets: 1972-73 and 1977-78 data for 13-year-olds and 
17-year-olds. Based on items common to two or more data sets, results 
were linked across ages and over time in eaph subtopic. Item 
parameters and group averages were obtained on scales common across 
ages -and years. Successful calibration and linking in all subtopics^ 
demonstrates the feasibility of applying item-response methods to 
sparse sampling designs. However, scaling must be accomplished within 
fairly narrowly-defined skill areas, such as the NAEP subtopics, if 
the integrity of scales is to be maintained. Item response scaling of 
NAEP test booklets as a whole is discouraged. Primary type of 
information provided by the report: Results (Secondary Analysis). 
(Author/CM) 
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Abstract 

Perhaps the two most significant advances in educational mea- 
surement over the past twenty years have been item response theory 
and multiple-matrix sampling designs. Unfortunately, few re- 
searchers interested in assessment have been able to. enjoy the 
full benefits of both advances simultaneously; the current methods 
cf item response theory .cannot deal with the sparse data (at the 
level' of\ individuals) that characterize the most efficient samp- 
ling designs. This research develops an approach based on item- 
response models defined at the level of salient subject groups 
rather than at the level of individuals, designed for use with the 
most efficient multiple-matrix desigrvs, i.e., those in which each . 
sampled subject is presented at most one item per scale. 

In each of ttfiree NAEP mathematics subtopics, Reiser's group- 
effects latent trait mode-1 was fit to the proportions of correct 
response to items as observed in the cells of a design including 
sex, race/ethnicity, region of the country, and size and type of 
community. , Item ^parameters and contrasts among demographic groups 
were thus estimated in each of four age/year data sets: 1972/73 
and 1977/78 data for 13-year olds and 17-year olds. (Data were 
taken from NAEP public release tapes from these age/years and from 
the NAEP mathematics 1972/78 "change" tape.) Based on i terns common 
to two or more age levels and/or assessment years ? results wer^e 
linked across ages and over time in each subtopic. Item para- 
meters and group averages were then obtained on scales common 
across ages and years, despite the fact that different (but over- 
lapping) sets of items had^ been administered in each age/year. 

Successful calibration and linking in all three subtopics demon- 
strates the feasibility of applying item-response methods to the 
sparse sampling designs of modern assessment. It is seen, how- 
ever, that scaling must be accomplished within fairly narrowly- 
defined skill areas, such as the NAEP subtopics, if the integrity 
of scales across demographic groups and over time is to be main- 
tained. In particular, item response scaling of NAEP test book- 
lets as a whole is to be most strongly discouraged as it virtually 
guarantees item parameter drift over time and poor fit to uni- 
dimensional item response models. 
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PREFACE 



Item response-curve models have triggered no less than a 
revolution in educational measurement. Little wonder, since so 
many* measurement problems that are difficult or impossible to 
solve within the framework of classical psychometric theory become 
quite tractable under the item response-curve approach; examples 
include analyses of the information that items' and tests provide 
at various levels of ability, measurement on an invariant scale 
from any subset of of calibrated items, and simplified test- 
equating procedures. , . 

To date these benefits have not been realized in the National 
Assessment of Educational Progress. The primary reason, perhaps, 
is NAEP's use of multiple-matrix sampling designs — efficient 
procedures 4 guaranteed to provide economical estimates of group 
level attainment*. Sufficiently precise estimates of group-level 
attainment may be obtained by administering only a few items from 
a given skill-area to any selected subject. Unfortunately, the 
current state of item response-curve theory cannot handle data 
such as gathered by NAEP, wherein each subject , responds to too few 
items in a specific skill area to permit the stable estimation of 
ability lev£l. 

This project is intended to further the extension of item 
response-curve theory to the assessment setting. The foundations 
of the present work appeared in the estimation procedures outlined 
in Bock (1976) that were later put into practice with the Cali- 
fornia Assessment Program" ( Bock, 1979; Bock and Mislevy , 1981 ) ; 
item response curve models are in these applications -defined not 
at the level of individuals but at. the level of salient groups of 
individuals. Reiser's (1980) dissertation research introduced a 
group-level item response-curve model that is particularly suited 
to NAEP data, addressing characteristics of test items and per- 
formances in the celljs of a design on persons. The present work 
develops procedures to link such results across assessment years 
and/or age groups. Examples are drawn from the 1972/73 and 
1977/78 NAEP mathematics assessments. 



' CHAPTER I 
INTRODUCTION AND BACKGROUND 



The Nature of Educational Assessment 

The purpose of educational assessment is to provide infor- 
mation about the levels of skills or attitudes in specified popu- 
lations oh subjects. Results may Bfe? compared from one population 
to another or from one point .in tirn^ to another, in order to* study 
the effects of educational treatments or societal trends. The 
distinguishing feature of assessments , however/ ig their focus on 
groups rather than- on individuals . 

By virtue of its distinct purposes , assessment requires a 
different technology than its ctose cousin, educational meas- 
urement. The ^true-score" models of traditional psychometr ics 
concern the measurement of individual subjects rather than groups 
of subjects; it is not surprising that the strategies of 
test construction designed to provide optimal measurement of 
individuals are not optimal for assessment. While borrowing 
heavily from the models and the concepts of educational mea- 
surement, assessment technology has gainfully employed ideas' from 
other fields a p s well, notably those of opinion survey sampling and 
sampling design theory . 
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The state of the art of .assessment in the United States is 
exemplified by the National Assessment of Educatioral Progress 
(NAEP) and the California Assessment Program (CAP). These two 
programs have, since trieir inceptions over a decade ago, been- 
proving grounds for measurement and statistical advances designed 

to obtain efficient and economical estimates of group-level at- 

f ■ 

tainment. Our attention will focus primarily on the National 
Assessment, although discussion of certain topics will be clari- 
fied with examples from the California Assessment. 

Tne National Assessment of Educational Progress charts levels 
of attainment in ten broad areas, including Reauing, Science, 
Mathematics, and Writing Skills. Each area is assessed periodic- 
ally ^ usually once every four or five years. Information is 
gathered mainly through the administration of multiple-choice and 
open-ended' tasks from the target area to subjects selected in the. 
NAEP sampling design. Demograph-ic -and~educat-ional background data 
are also obtained for each sampled subject^ J_ Results are reported 
as proportions of correct response to individual items and clus- 
ters of items.., for groups of individuals defined by demographic 
variables such as- age, sex, region of the country, size of com- 
munity, and so on. 




Comparing NAEP results over time or across age-groups re- 
quires measurement on an invariant scale. Proportions of correct 
response for a given item may be compared across all the assess- 
ments in which it was ° administered. . . but dertainly trends in, 
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say, Mathematics skill are inadequately revealed by performance on 
any single item. To overcome the idiosyncracies Q of individual 
items, information must be combined over several items testing the 

same essential skills. 

/ 

✓ 

Average percertts-correct Qver clusters of items x may instead 
be followed, but only as long as the composition of the cluster 
does not change. NAEP uses -this option at present, but it is 
hampered by the fact that typically "one fourth .of the items in each 
assessment are released to the public and retired from the item 

9 

pool. Comparisons across assessments of average percents-correct 
of clusters of items will become less reliable as the numbers of 

it 

common items shrink over the years. 

-4 

This report explores methods^by which modern item response- 
curve .measurement theory may be applied to the assessment setting 
to solve the problems of charting progress over time. The next 
section of this chapter reviews the basics of mult iple-matr ix 
sampling theory, the development which has contributed so much to 
the success of large-scale assessments such as NAEP to date. It^ 
is upon this sampling framework that measurement models-must build 
if they are truly to advance the practice of assessment. Next,, 
the current practices of reporting assessment results are 
reviewed. Their limitations and prospects for overcoming them are 
discussed. Tfie chapter concludes with a succinct statement of the 
objectives of the present research. 

.ft 
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Multiple-Matrix Sampling/ 



' The accountability movement of the 1960 f s inspired the* cre- 
ation of a number of local and statewide testing programs intended 
to provide feedback about the, effects of public expenditures on - 
education. The methodology employee! in these programs was that of 
standardized achievement testing. - Every pupil in a school or 
classroom was administered an achievement test consisting of as * 
many as two hundred test items, an undertaking demanding hours or* 
even/days of classroom time from* each pupil. Designed to provide 
maximal differentiation among students, these tests yield highly 

■r 

accurate scores for each pupil in but a few broad skill areas. 

Averages of pupil-level scores obtained in such ,a scheme did 
indeed reflect levels' of performance in the school or classroom, 
but in a most highly inefficient manner. The administration of 
intensive every-pupil testing with traditional achievement tests 
suffers several serious- deficiencies if it is only the group-level 
results that are-necessary for discussion by the public and the 
educational community.- * The^ large numbers of items which must be 
administered to a student in a skill area if distinctions are to 
be made among^ students are simply not necessary if only infor- 
mation about averagfe levels of attainment in the group a^ a whole 
is desired. Such a scheme expends scarce educat ional resources to 
measure each student much more- precisely than is required in an 
assessment, but by providing results in only a few broadly- 
conceived skill areas, offers little in the way of specific guid- 
ance f or ^impfoving the curriculum. 

* -4- 



During .this same period, sample survey techniques were be- 
coming a familiar and widely^acce^pted mechanism for guaging the 

« - ^ 

strength of various attitudes and opinions among the public, 
mainly on issues of social or political relevance. Not" every 
person is interviewed; not every person interviewed is asked all 
the same questions. Yet satisfactorily precise and reliable 
information is* obtained* about the*prevalence of attitudes in the 
public at large". Why not apply these same methods to ^educational 
assessment? 

At the request of William Turribull, president of Educational 
Testing Services, Frederic Lord 'investigated the possibility of 
estimating levels of ability in a 1 population by means of "mul- 
tiple-matrix sampling" — that is, by administering different sub- 

sets of an item domain to different samples of persons. (Lord, f 

4 # . 

1962; Lord and Novick, 1968, Chapter 11). 

' 7 ....... 

The simplest application of mult inle-matr^ix sampling is in, 
estimating the average item -score in a population of N "sub jects 
for an item pool of K test items. The average score that would be 
obtained by administering every item to every subject can be 
approximated by observing the responses of, say, t different 
random samples of n subjects eaph to random samples of k items 
each^ r ( c This is referred to in the multiple-matrix sampling lite?- 
ature as a t/k/n design.) The expected value of the average item 
score over all such samples is the population average item score.. 



One of the'most important results of'Lord's investigations 



was the conclusion that the estimation of the population average 
is most ^precise for a given number of responses^when k=l; that is, 
when the v responses for, different items have been obtained from 
non-over lapping samples of subjects. Stated simply, two responses 
contain 0 more information about the population if they are from 

different persons "than if they are from the same person. 

* » '"** a * 

Pandey and Carlson (1976), in a study of datg from the Cali- 
fornia reading assessment*, found the effect to be generalizable . 

5 \ . 

The error variance associated Vith estimates of the population 

mean was reduced almost four-fold "when , for the same number of 
responses forms of ten items each were 'administered to samples of 
ten subjects each, as compared to a design under** which th£ items 
were administered as two fifty-item forms to ten subjects each. 

Practical work generally requires a more complicated sampling 
design than those described above . In the California Assessment, 
as an example, results must be reported individually for each 
'school. In the National- Assessment, results must: be reported for 
the cells and the margins q.f ^a design based on demdgraphic vari- 
ables* Sampling of subjects must therefore be carried out within 
levels of stratification in both cases, allowing for? the possi- 
bility of different selection 4 prdbabil i ties within different cells 
to meet -requirements for" the precision of estimatioj^^ 
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Item pools are generally statified as well, into divisions of 
increasingly narrower skU.1 requirements. .The' goal is to def ine 
classes of items which are similarly affected by specific attrib- 
utes of educational treatments, in order that treatments can be 
monitored and modified as a result of the feedback from the asse- 
ssment. Our attention in this paper will focus on items within 
t>he finest level of stratification of the item pool, which, fol- 
lowing the practice of the California Assessment; x y e refer to as a 
"skill element. " u 

-Reporting Assessment- Results 

As noted above, comparisons of assessment results over time 
and across assessments requires measurement on an invariant scale. 
The method by. which this requirement is achieved in publir opinion 
survey research is to present subjects with questions that remain 
constant 'over time; by asking, for example; "Which of these can- 
didates would you vote for if the election were today?" of sub- 
jects interviewed during the six months before the election, one 

o 

may chart the flow o of public support behind the candidates. 
Tyler's ( 1968) remarks and Womer's (1973) monograph suggest -that 
this same method was originally intended for reporting the results 
of the National Assessment of Educational Progress. 

The "fixed-item" approach to reporting the results of assess- 
ments, as it.might be called, focuses on comparisons of perform- 
ance between groups or across time on a single, specific task. 



Interpretation is straight-forward as it applies to performance on 
that particular item, but the problem lies in generalizing the 
results. . The 1972/73 NAEP Mathematics Assessment, for example, 
presented 200 items to 13-year-olds alone; results for the cells 
of a sex by race by size-and-type of community design would have 
to be expressed, as some ten thousand separate percent-correct 
values. Comparisons across groups would vary across items as a 
result of measurement error as well as with the skills tapped by 
the items. How could such a preponderance of detail be suited to 
general public dissemination or discussion? 

Educational test items, unlike public opinion survey ques- 
tions, are not usually important in and of themselves, but instead 
as representatives of a class of tasks requiring similar skills. 
It is these generalized skills rather than the specific items that 
are addressed by instruction, and it is this level at which as- 
sessment results must be reported. The technology of educational 
measurement, as it had developed by the early 1970's (see, for 
example, Cronbach et al , 1972), was able to provide a framework 
for generalizing results across test items within a skill area: 
the "random-item" model. 

Under the "random-item" approach to reporting the results of 
assessments, the specific items ^from a given skill area are con- 
sidered a random sample from a population of items that, taken 
together, defines the area. The average item score by v group of 
subjects to a randomly-selected subset of these items is an esti- 



mate of the group's average for the entire population of items in. 
the area, ^ecause results are averaged over a number of items, 
peculiarities of item formats and distractors tend to cancel out , 
revealing trends which underly performance on all the items in the 
skill area. Under this model, average item scores may be compared 
across groups and over time even though different sets of items 
may have been administered, as long as the set of items admin- 
istered in each assessment has been chosen at random from an 
invariant population of items. 

The assumptions of the "random- item" approach are not, 
unfortunately, met in general practice. The problem lies with the 
requirement of randomly sampling items from an invariant item pool. 
If comparisons are desired across age groups, for example, sub- 
jects must be presented items from the same item pool; the effi- 
ciency of°estimation suffers if younger subjects are presented 
just as many hard items as are necessary to tap the skills of 
older subjects, and older subjects are presented too many'easy 
items just so the younger subjects can be tested. 

A more serious problem is the charting of results over time. 
If item pools remain invariant, they cannot reflect new emphases 
in educational treatments nor can they retire items which have 
outlived their usefulness; neither can items be released to the 
public to aid the interpretation of results without compromising 
the integrity of the measurements. Yet if apparently desirable 
revisions to certain items are carried out, the average item 



scores estimated in different assessments are not comparable; they 
are estimates of performance in a shifting collection of items, 
perhaps harder or easier on the whole from one year to the next. 
Changes in subjects' skill levels are confounded with changes in 
the composition of the item pool. 

The National Assessment, recognizing this problem, has re- 
sponded "by reporting results that are to be compared over time in 
two different ways. 

For non-technical reports slated for public release, propor- 
tions of items correct are reported over all items in a skill 
"area, despite modifications in the item pool and decidely non- 
random selections of items in an assessment. The comparisons 
implied by the figures for different years, althouqh they do not 
meet the assumptions required to assure meaningf ulness , are con- 
sidered useful nonetheless. And indeed they may be good approx- 
imations of the comparisons that would have resulted under ideal 
conditions; i.e., true random sampling in both years from. a fixed 
pool of items. 

For scientific investigations, NAEP provides reports and data 
tapes based on only those items which appear on all the assess- 
ments to be compared. The assumptions of the "random- item" model 
may be met in this way, defining the skill area to be that which 
is measured by the average of that specific collection of items. 
Comparisons r restricted to these so-called "change items," suffer 
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from the culling of items that cannot be matched, in that poten^ 
tially useful information from these items must be ignored. The 
resources expended in gathering this information have not been 
justified in this respect. Analyses of trends over time are 
as well, as the set of items common to all assessments in ques- 
tion tends to shrink when more time points are considered. 

Similar problems in the measurement of individual subjectsf 
have been overcome with the advent of item response-curve (IRC) 
measurement models. An IRC model individualfy parameterizes each 
* te;st item in a suitable domain in terms of its relationship to an 
underlying scale of ability. Subjects may then be measured on an 
invariant scale of attainment, based on their responses to any 
subset—not just a randomly selected subset—of items. The chal- 
lenge is to apply the methods of IRC theory to the setting df 
assessment, borrowing concepts and machinery to free reporting 
from the constraints of classical test theory, while at the same 
time building upon the multiple-matrix sampling framework. 

Problem Statement 

The objective of this research is to further the development , 
of one approach to applying item response-curve methods to 
assessement data; namely, Reiser's (1980) §roup effects model, 
which (1) allows the estimation of group-level parameters from 
item responses obtained i J n an efficient multiple-matrix sampling 
design, (2) yields these parameter estimates on a scale that is 
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invariant over time and across groups, and (3) permits the evolu- 
tion of the item pool over time without degrading the integrity or 
the generality of the results. The steps we take to this end are 
as follows: 

1. Develop an algorithm for linking estimates from the Reiser 
model across assessments. The approach will be based on a 
proposal by Tucker (1948). Linear transformations are 
determined to provide optimal agreement among estimates 
from an arbitrary niimber of assessments, linkfed by an 
arbitrary pattern of common items. \ - 

2. Demonstrate the use of the grQup-ef f ects model and the 
linking program with data from the NAEP 1972/73 and 1977/78 
Mathematics Assessments. Scales will be linked across 
assessment years and across the 13- and 17-year-old age 
groups in three skill areas. 
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CHAPTER II 

ITEM RESPONSE-CURVE METHODS FOR ASSESSMENT DATA ^ ^ 
This chapter develops an approach for adapting item 

jr.*. 0 

response-curve methods tc$ the assessment setting. The first 
section r is a brief review of item response-curve theory as it has 
been developed for measuring individual subjects. Features and 
properties of IRC models that will be important in our generalisa- 
tion to group-level data' will be emphacized. The second section 
discusses the notion of applying IRC methods to- data obtained from 
multiple-matrix sampling designs. In particular we consider the 
option of defining IRC models at the level of subject groups 
rather than individuals. The third section is a non-technical 
description of Reiser's group-effects, an IRC model defined at the 
level of groups that is particularly well-suited to the demo- 
graphic stratifications used by the National Assessment. The 
final section discusses the linking of results from the Reiser 
model from one assessment to others r thus providing the continuity 
of measurement necessary d or longitudinal analyses. (The topics 
treated in the last two sections, the Reiser model and the linking 
procedures, are treated in a more technical manner in Appendices A 
and B respectively*) 



Fundamentals of Item Response-Curve Theory 

The models of item response-curve theory differ most radi- 
cally from the models of traditional "true-score" psychometr ics by 
parameterizing test items individually in terms of their rela- 
tionships to the underlying ability , rather than treating them as 
random samples from a pool of interchangable items. Once a set of 
items has been "calibrated" (i.e., the parameters of the items 
have been estimated), a subject's ability can be estimated from 
his responses to any subset of the items. This is the case even 
when the items he has been presented are only' easy ones or only 
hard ones—assuming that the IRC model' fits the circumstances 
reasonably well. 

°The heart of an IRC model is a mathematical equation for 
the probability of a correct response to a particular item by a 
particular subject, in terms of one or more parameters that in- 
dicate the subject's abiliy and one or more parameters describing 
how responses to the item are influenced by ability. 

To illustrate, we will consider the Birnbaum " 2-parameter 

logistic item response-curtfe model, which will be seen to share 

many similarities with Reiser's group-effects model for assessment 

data. The probability that Subject i will respond correctly' to 

Item j is given by the following function: 

exp[1.7 Aj (Gi - Bj)] 

Prob(Xij = l) = ■ ~ 7- U) 

<5 1 + exp[1.7 Aj (9i - Bj)j 

where 
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Xij, the response, is 1 if it is correct and 0 if not, 

* * 

exp is the exponential function, 
9i is the "ability" parameter of Subject i, 
Aj is the "slope" parameter of Item j, and 
Bj is the "threshold" parameter of Item j. 

(The scaling constant 1.7 is included in this egression in order 
to make the item parameters in the Birnbaum logistic mQdel match 
more closely the item parameters in the normal ogive model.) 

The function shown above describes how likely it is that a 

i 

subject with a given ability will respond correctly to Item j. 
This function *can be graphed, as in Figure 1: the item response 
curve for Item j. It may be seen that subjects with very low 
values of 6 have little chance of responding correctly. * As 9 
increases, so do chances of responding correctly. For a subject 
with an ability that has the value Bj (the threshold of Item j), 
the chances of a. correct response are 50-50. As ability continues 
to increase, chances of responding correctly increase also until ; 
it is nearly a certainly at very high levels of ability. 

When this model is fitted to data, it is capable of 
accounting for the facts that — 

(1) N5ome subjects perform better than others on the items 

.in ^e skill area. 

(2) Some it^n^s in the area are easier than others. 

(3) Some itemsk*^ more reliable indicators of the ability 



than 'othe'rs . ! 
Figure 2 shows three different item response curves on the same 
*plot. It may be seen that, on the average, Item 3 is harder than 
Item 2, which is harder than Item 1. It may also be seen that the 
higher the value of, 'an item's slope, the more sensitively an item 
~ reacts to changes in subject ability. Item' 2 is more informative 
than Item \ which in turn is more informative than Item 3. 

The manner in which subject and item parameters combine to 
produce probabilities of correct response is illustrated in Tables 
1 and 2. Table 1 shows, for the four hypothetical items and six 
hypothetical subjects, the quantity 1.7 Aj (ei - Bj). The orderly 
relationships among the parameter values are most clear in this 
chart, showing what are called the "legits" of correct response. 
Table 2 transforms these logits to the more familiar units of 
probabilities via Equation 1. 

It may be noted at this point that the units for the subject 
and item parameters are unique only up to a linear transformation. 
That is, equivalent relationships may be expressed by transforming 
all the parameters by the linear function f(x)=mx+b as follows: 

ei* = m 9i + b 

Aj* = Aj / m 

Bj* = m Bj - b. 

It may be readily verified that these transformed subject and item 
parameters yield exactly the same probabilities of correct re- 
sponse as the originals, since 
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FIGURE 1 
AN ITEi-I CHARACTERISTIC CURVE 




TABLE 1 

LOG ITS OF EXPECTED PROPORTIONS CORRECT 



ITEMS 



SUBJECT 


©i 


Bj: . 
1.7 A j : 


-2.000 
1 .'000 


-1.000 
0.500 


1..000 
0 . 500 


2.000 
1 . 000 


1 


-0.500 
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1.7 Aj* (6i* - B*j*) = 1.7 Aj (6i - Bj). 
The implication is' that when parameters are estimated for the same 
se,t of items from different sets, of data, they can be expected to 
differ by such as linear transformation. It is a practical prob- 
lem to estimate the optimal transformation; various solutions and 
proposals have been made by Tucker (1948), Lord and Novick (19-68), 
and Haebera (1981) . 

Benefits of Item Response Curve Models 

^When— a ^ott 



adequately summarized in terms 1 of an item response-curve model 
like the one described above, several benefits accrue: 

Invariance with respect! to item selection . Once a collection 
of items has been calibrated (i.e. /the parameters of the items 
have been estimated), a subject's ability may be estimated on the 
basis of his responses to any subset of the items — randomly se- 
lected or not. This, means that, as in example, younger students 
may be "administered mostly easy itepis while older students are 
administered mostly difficult items from the same scale. 

New items can be added to the domain . New items measuring 
the same trait "can be linked into an existing scale by administer- 
ing them along with items that have already been calibrated. The 
new items, c^an be calibrated from this data, then their fit to the 
model verified before they are used to estimate subjects' 
abilities. V 



Ed awed items can be corrected . , Items found td^have flaws in 
their grammar, format, or conception can be revised, then re-cal- 
ibrated , into the domain as if they were new items. 

Items can be dropped from the bank . Without affecting the 
the scale of measurement,* items may be retired from use, 'either 
because they are outdated or because they will bfe used to illus- 
trate the content area in "reports released to the public. 

! '■" , 

Content-referencing of score s. The scaled scores (0) from <, 
—I-RC- -theory- are def ined imp! i-c-i t-ly -by~ttoe-~probabiXiti.es ol™correct 
responses they imply f or^ each of the items ip the skill area. The 
meaning of an ability estimate^ can be interpreted, therefore, by$ 
inspect ing . the content of the items with thresholds in that region 
of the scale — without reference to the distribution of ability in 
any population of subjects. Scale scofes may still be interpreted 
in the more f ami liar , manner of norm-referencing of course, with 
the cpmpuation of percentiles, stanines, standard scores/ and so 
on, with regard .to specified populations of subjects.. 

Linearity with external variables . Because IRC ability 
estimates are not subject to the floor and ceiling^ effects of 
numbers-jr ight and percents-correct , they tend to have more l'inear , 
relationships with external variables such as, SES, age, and years 
of education. 

Well-defined standard errors . Because items are large-sample 
calibrated, they are considered 'fixed' rather than 'random' from 

* ' • - ' ~ 18 ~ ' s 



a statistical point of vtew. For this reason, standard errors of 
estimating subjects' abilities, and', indirectly, reliabilities, 
are easy to compute (see^Lord & Novick, 1968; Mislevy, 1981). 
Moreover, these standard errors are correctly^ expressed as a 

function -of the ability itself rather than gratuitously and erron- 

*• & 

eously assumed constant as in classical * theory f or ; number-right 
scores. 

Suitability for longitudinal studies . With the*use of alter- 
nate test forms consisting of* items from the same scale, IRC 
ability estimates are amenable to the study of trends or program 

effects. ' 1 

» 

Assumptions' of Item Response Curve Models 

If an IRC model is to fit data, the assumptions .Of the model 
must be reasonably well satisfied. The main assumptions are 
discussed below. * 

UnidimensionalitY . Nearly all applied work uses IRC models 
that assume a single underlying ability scale. This means' that 
subjects' differing probabilities of correct response, with re- 
spect to each of the items in. the scale, can be described by. a 
single variable. If one subject's probability of .a correct re- 
sponse to a give*, item is higher than that of a second subject, 
the assumption of unidimensional i ty implies that the f irst ^subject 
lias higher probabilities of correct response than the second, 
subject on all the items in Uie scale. 
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Local (conditional) independence . A subject's response to a 
given item is assumed to depend only on his level of ability, not 

on extraneous factors such as the position of the item on the test 

i 

or his responses, and reactions to preceding items. f 

Temporal stability . Item parameters, and equivalently , 
Relationships among items, must remain stable over time to guaran- 
tee the comparability of ability estimates over time. 

Goodness-of-f it . The item and -person parameters of the IRC 
model must accurately account for the probability of a correct . 
response to any item from any subject who is to be measured. This 
is eouivalent to saying that the item parameters and the^scale of 
ability they imply must be invariant over subjects. (Experience 
has shown that the more homogeneous the content of the items, the 
more likely it is that this assumption will be satisfied.) 

Satisfying these assumptions is more of a skill than a sci- 
ence. During the past decade, practitioners have begun to build 
up the body of* experience necessary to apply item response curve 
theory at the level of measuring individuals. Still questions 
remain, concerning topics such as the range of ability over wh^ch 
• item parameters can retain the same values and the possibility 
that item parameters may 'drift' over time. . , 

With. the exception of the California Assessment Program, 
there has been little experience to date with the problem of 
meeting the assumptions of IRC models in the context of the sparse 
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(at the level of individuals) samples of item responses gathered 
in efficient multiple-matrix designs. Guidelines derived from 
their experience will be discussed in the following section, and 
employed in the examples in the following chapter. 

Application -to Multiple-Matrix Samples of Responses 

CLearly, the advantages of item response curve theory offer 
considerable benefit to educational assessment. Not only can -the 
restriction of a fixed item bank from which items must be drawn at 
random be lifted, but results can be reported on a content- 
referenced scale: estimates of levels of attainment can be inter- 
preted in terms of probabilities of correct response to the items 
whose thresholds define a scale. 

The main obstacle to the application of IRC theory to the 
assessment setting is, ironically, the° efficient design of the 
multiple-matrix samples. IRC theory, as presently conceited, has 
been designed for the measurement of individual subjects. To 
estimate the ability of an individual subject with IRC methods, 
several items from the scale must be administered to him. This 
practice is at odds with the aim of multiple-matrix sampling, 
whiqh provides economical information about groups by eliminating 
the measurement of individual subjects. 

There are three approaches by which the technology of IRC , 
theory can be applied to multiple-matrix samples of responses. 
The* Ipllowing paragraphs consider each in turn. 



Subject-level model, subject-lev el estimates. The first 
approach by which IRC methods may be applied to multiple-matrix 
samples of item responses employs an IRC model like those descibed 
in the previous section, modelling the probabilities of correct 
response of individual subjects. Each subject sampled for a given 
skill area is administered- enough items from that skill area to 
permit the estimation of his ability. The resulting estimates of 
the abilities of individual subjects may then be averaged over 
subpopulations as desired. 

a 

Several benefits of IRC theory may be enjoyed under 
this approach. First, the necessary computational 
methods are available, having been developed over the past decade 
for use in measuring individuals. Second, the restrictions on the 
item bank are relieved; items could be dropped from the bank or 
new ones could be calibrated in. Third, content-referencing of 
score estimates is possible. And fourth, random selection of 
items for teLt forms is no longer required; harder and easier 
forms could be developed and administered appropriately, so thai 
the items an -individual takes may be more informative about him 
and, consequently, about the subpopulations to which he belongs. 

As noted above, however, this approach will require the 
administration of several items from that area— perhaps as many 
as fifteen oil twenty-proscribing the use of the most efficient 
multiple-matijix designs. (Pandey and Carlson [1976] demonstrate a 
380-percent ilncrease in efficiency for estimating a group average 
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using ten-item forms as compared to fifty-item forms, observing 
the same" number of responses in both designs.) If the application 
of IRC theory is truly to advance the state of the art of assess- 
ment, it must build upon the advances already gained through the 
use of efficient sampling designs rather than discard them. 

Subject-level model , group-level estimates . A second ap- 
proach is to define an IRC model at the l'evel of individual sub- 
ject s^butr^tO" estimate the parameters of the distributions of 
ability in subpopulat ions directly, without estimating the abili- 
ties of individual subjects. * 

This approach is well conceived for application to the most 
efficient multiple-matrix designs. Each sampled subject in the 
assessment must be administered only one or two items in any 
skill area. It is possible to estimate the parameters of the 
distributions of ability in any subpopulation on the basis of the 
responses of the subjects sampled from that s,ubpopulat ion, and to 
estimate relationships, among skill areas or between skills and 
external measured variables — all without estimating the ability of 
any individual subject. 

Efficient methods~of estimation under this approach are still 
under development. The rudiments for one method are found in 
Andersen and Mads^n (1977) and Sarveutiianan and Blumenthal (1978), 
which discuss the estimation of population parameters from sub-, 
ject-level data under the restrictions that all subjects have been 
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administered the same set of items and the IRC model is the one- 
parameter logistic. Extension to the general IRC case and to 
multiple-matrix data are given by Mislevy (1982). % 

Group-level model, group-level estimates . . The third ap- 
proach, the one upon which the present research is based, defines 
an item response curve model at the level of subject groups rather 
than at the level of individual subjects. One or more item para- 
meters still relate each item in a skill area to an underlying 
scale of attainment, but the ability (or attainment) parameters 
are for groups of subjects rather than individuals. Rather than 
modelling the probabilities of correct responses from specific 
individuals, a group-level IRC expresses the probability of a 

correct response to a ^particular item from a subject selected at 

* * 

random from groups at the various levels of attainment. A group 
ability parameter may thus be Interpreted as the average over the 
subjects in that group. 

A group-level IRC would be defined at the lowest level of 
stratification of the population of subjects for which results are 
to be studied or reporte d. In the California Ass e&smeiijL>_£oT.— — 
example,, an IRC is defined at the level of schools; school-level 
score estimates are then averaged to the levels of districts, Los 
Angeles areas, counties, and the state as a whole when desired. 

The sampling scheme upon which such ah IRC model is based is 
the most eff icient' multiplermatrix^ sampling design, in which each 

-24- 

CP 

\ 



sampled subject responds to at most one item from any given skill 
area. Under this design and in a group-level model, the responses 
of the individual subjects from a given group may be considered 
independent, given the ability parameter of the group. In this 
way the IRC assumption of local independence is satisfied. 

Group-level IRC's can be justified in two different ways. 
First, they may be seen simply as models for data analysis which 
may be used profitably when they are able, with their item and 
group parameters, to describe the matrix of itercrby-group propor- 
tions of correct response in the skill area under consideration. 
In this sense they are a generalization of logistic models for the 
analysis of binary data (Cox, 1970; Bock, 1975), with any 
interaction terms between " item" factors and "subject-group" 
factors are constrained to follow the patterns describable as item 
response curves with possibly different slope parameters. Second, 
group-level IRC's may be seen as an integration over group dis- 
tributions of phenomena described by subject-level IRC's. Under 
this interpretation, the distributions in all groups are assumed 

a 

identical in shape, and may differ only as to location. 

In two special cases a simple relationship exists between 
item parameters from the group-level IRC and those from a subject- 
level IRC. (1) The distributions of ability within groups, may be 
considered, to be concentrated on a single point (Bock, 1976), in a 

........ 1 ■ a i '. , 

which case the item parameters in the group-leyel IRC would be . 
identical to those' in the subject-level IRC. This case assumes 



that the grouping of subjects accounts for all systematic vari- 
ation among them. (2) The distributions of abilities may be 
assumed normal within groups, and, if a normal ogive subject-level 
IRC is assumed, the group-level item threshold and slope para- 
meters are functions of the subject-level item parameters and the 
common dispersion of ability within groups (Mislevy, 1982). 

Defining Skill Areas for Assessment 

Comparing attainment over time or across subpopulat ions 
requires item parameters that remain stable over time and across 
subject groups. This requirement is most easily achieved in the 
setting of individual measurement by scaling within skill areas 
defined narrowly rather than broadly. 

This prescription can be a burden in the setting of individ- 
ual measurement, since it implies that each individual to be 
measured must be administered several items from each of several 
skill areas, defined narroWly to guarantee stable item parameters 
but highly correlated in the population of individuals. n 



The same prescription c^ri be a boon in/the setting, of assess- 
ment. In efficient approaches of IRC application to assessment, 

\ / 

each subject will be administered only *j few items from each 
separately scaled skill area. \ This mea/hs that the number of skill 
areas which can be measured a.t the level of v groups can be very 
large, -without requiring excessive time for administration.. The 
Grade 3 California Assessment, for /example, measures school-level 
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attainment in 61 skill elements, while requiring less than an hour 
from each subject. (The items from these skill elements are 
distributed among thirty different test forms, each of which 
contains thirty-four items from different elements; each sampled 
subject is administered one randomly selected test form.) 

The manner in which skill elements are selected for separate 
scaling is based on the requirement for stable relationships among 
items' relative difficulties. The California Assessment Program 
has attempted to daf ine skill elements in terms of educational 
practice: if skill elements are based on "indivisible curricular 
elements", all items in a scale will be similarly affected by 
curricular change. Changes in a school's performance over time, 
then, may appear as increases in one element and decreases in 
another, but will be consistent with respect to all the items 
within an element. Because progress (or lack of progress) can be 
monitored at the level at which educational treatments are ap- 
plied, CAP results help school officials adjust the balance of 
emphases of various components of the curriculum. 

Reiser's Model for Group Effects 

The National- Assessment of Educational Progress (NAEP) em- 
ploys sampling at both the item and the subject level. The assess 
ment instrument in a given content area is constructed of items 
sampling specif ied objectives and assigned to one of a number of 
forms, with the number of forms varying across age levels and 



content areas in accordance with the number of objectives in which 
performance is to be measured. Each such form is administered to 
a national probability sample of approximately 2500 persons, 
selected by the cluster method from the appropriate age group (9-, 
13-, or 17-year olds). Pupils, designated by age rather than 
grade, are tested by NAEP personnel outside the classroom. All 
pupils in any one testing session are administered the . same .form, 
* and are "paced" through* it by a tape recording that determines the 
amount of time spent on each item. Free-response as well as 

mult iple-choice items are used. 

» 

The results of these tests are aggregated not to the level of 
schools, as in the California Assessment, but to the cells of a 
multi-way demographic classification of subjects. Ways of classi- 
fication include age , sex, racial/ethnic group, size and type of 
community, region of the country, and parental education. The 
emphasis is on measuring progress (change) in attainment of the 
objectives as seen in the population as a whole and in the subpop- 
ulat ions defined by the demographic classifications. Typically 
each assessment deals with one or more' "content areas, each of 
which is typically assessed e^ry f our or five years. > 

At present NAEP does not make use of any type of scale-score 
• . . * \ ■• . 

reporting." Results are expressed, as. percents-correct for items 

slated for public release, or 'as average percents-correct over the 

items in an objective or a, content area. Because of the diffi- 

' • ■ " . \ " > 

culties mentioned above' with the interpretation of these averages 



when, the item pool changes from year to year, NAEP could make use 
of the item-invariant scales 'offered by IRC^models. 

In dissertation research, Reiser (1980) generalized a model 
by Bock (1976) to provide a group-level IRC appropriate to" the 
aims and the "current practices of NAEP assessments. The Reiser 
model is based on the following assumptions: 

i, JL. Each objective is scaled separately. The items within an 
objective are considered sufficiently homogeneous to func- 
tion as what is referred to in the California Assessment 
.as a "indivisible curricular unit." That is, differences 
between subpopulations and changes over time may differ 
across objectives , but are essentially the same for all 
the items measuring a given objective. 

2. Each item representing an objective will appear on a dif- 
ferent test form. fin present NAEP assessments, this 
assumption is not strictly satisfied; occasionally two or 

\ three items from the same objective appear on the same 
form. ) 

3. The distributions of ability within each of the cells of 
the classification scheme have., the same shape, coffering 
at most by location (We., cell average levels of attain- 
ment). The demographic classification is assumed to ab- 
sorb all viariation between schools or other levels of 
clustering in the sampling design.-— ... ... 
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I 4. The expected proportion of correct responses to an item 

within the ultimate subclasses of the demographic classes 
is a two-parameter logistic function cf the parameters of 
the item and demographic effects for that cell. p 

An Example 

Perhaps the best way to introduce Reiser's model is with a 

0 

(relatively) simple numerical example. We begin by saying that 
the form of the, model "is very similar to that 6f the 2-parameter 
logistic IRC model described above, except that the focus is not 
explaining the probability of a correct response from a specified 
subject, but for a subject selected at random from a specified 
group (i.e., a cell from the demographic classif icatidn scheme). 

Consider four items from a skill objective and the six cells 
of a sex-by-age design, including ages 9, 13, and 17. "Assume that 
pilot-testing of the items has indicated their relative diffi- 
culties, To each age group, only the two items at the, appropriate 
level of have been administered: Items 1 and 2 to 9-year olds, 
Items 2 and 3 to 13-year olds, and items 3 and 4 to 17-year olds. 
For each item targetted for a given age level, random samples of 
subjects from each sex are administered the item. All sex-by-age- 
by-item samples are equal in size. Suppose that the proportions 
of correct response observed in this administration a're as .shown • 
in Table 3. 
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Now the comparison of average percents-correct over all the 
items taken suggests a decline in attainment as age increases, 
from .610 to .500 to- . 317 . ■* This result is clearly an artifact of 

the design of administration . It is obvious in this example that 

P ■ . 

average percents-correct cannot be compared across sets or items 
that differ in difficulty. 

One alternative is to compare age groups on the basis of the 
items that they have taken in common. Item 2, for example, shows 
.500 correct for 9-year olds and ."621 correct for 13-year olds ; 
'tern 3 shows .380 correct for 13-year olds and .439 for 17-year 
>lds. These comparisons, illustrating increasing levels of per- 
formance with increasing age, are valid but inefficient^; each is 
(based on only half the data available from the age groups, being 
compared. Moreover, no such comparison can be made between 9- and 
17-year olds, because they have taken no. items in common.. t 

The first step in understanding Reiser's model is to consider 
the logits of these proportions, as shown in Table 4. The model 
attempts to explain these values as functions of. item parameters ^ 
Bj (threshold) andAj (slope), and cell average attainment (9kl , 
where k designates sex and 1 -designates age). The form of the 
model is as follows: 

Ljki = 1 .7 Aj (ekl - Bj ) , 

where Ljkl. represents the logit of ■ the proportion "of correct 
.responses to Item j from the cell. with sex' designation k and age 
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TABLE 3 
OBSERVED PROPORTIONS CORRECT' 
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TABLE 4 

LOGITS OF * EXPECTED PROPORTIONS CORRECT 
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designation 1. In terms of proportions correct , the logits are 



transformed as follows: 

exp(Ljkl ) 



Pjkl - 



I + exp(Ljkl) 



U) 



From the observed logits of correct response , . item and group 

parameters must be estimated., In Reiser's model, as with all 

! ■ 

2-parame£er logistic IRC models., there are two linear dependencies 
that must be resolved arbitrarily — it is this fact that permits 
alipxpar/ameters to be rescaled by a linear transformation as dis- 
cussed above.' In this example we resolve them by restricting the 

* t ■ * 

average of the thresholds of the items to be one and the distance 
between the highest and lowest thresholds to be four. Under these 
constraints, "the estimates of the item and group parameters are as 
follows: 



Item 


Threshold 


Slope 




1 




-2.00 


1.00 




2 




-1.00 


0.50 




3 




1.00 


0.50 




4 




2.00 


1.00 




Age 


Sex 


Ability 






9 


F 


-0.50 






9 


M 


-1 .50 






13 . 


F 


0.50 






13 


M 


-0.50 






17 


F 


1.00 


K 




17 


M 


0.00 







As befits an artificial example, these estimates perfectly 
account for the observed proportions of correct response as shown 
in Table .3, when combined via Equation 2. An examination of the 
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ability, or -scale score, ^lues for the 'demographic groups shows a 
clea^increase in levels of attainment with increasing age, from 
-1 . Q d tq\D . 00 to 0,50 wi th ma le s and f ema 1 e s aver ag ed i n each age 
group. Mor^ver/ this pattern accounts for the differences be- 
tween ages for all items. The comparispn is thus based on all the 
observations. x 

The second step in understanding the Reiser 'model requires a 
closer look at the scale scores of the six sex-by-age cells. As 
noted above, score averages for age groups with sexes combined are 
-1.00, 0.00, and 0.50. Score averages for sex groups with age 
groups combined are 0.50 for- females and -0.50 for males. To- 
gether these age and sex marginal effects account for each of 
individual ceils; that is, there is no sex-by-age interaction. 
To obtain the scale score of any cell, three steps are required: 

1. Start with an initial approximation of 0.00. 

2. To account for the age effect, subtract 1.00 if the cell 
is for 9-year olds and add 0.50 if it is for 17-year olds. 

3. To account for the sex effect, subtract 0.50 if the cell 

»>. 

is for males and add 0.50 if it is for females. 

i 

e 

A distinguishing feature of Reiser's model is that the levels 
of ability in the ultimate subgroups in the design need not. be 
estimated individually, but may be expressed as functions of some 
smaller number of effects related to the ways of classification. 

X 
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Statistical tests for the presence of ef f ects ,~both main and 
interaction, are easily obtained by comparing how well various 

nest ed mo dels explain the obser v ed proportions of correct item 

responses across the cells of the design. 

Reiser's dissertation research, as an example, used a classi- 
^ication scheme based on sex, race, and size-and-type of community 
(STOC). The analysis concerned Skill in Computing Fractions, with 
data from the .,977/78 assessment of 13-year olds. He found that 
the variation among the attainment levels of the cells in this 2- 
by-3-by-7 design could be explained in terms of just main effects 
for the three variables and race-by-sex interaction. 

The parameters of Reiser's model may be estimated by the 
method of maximum likelihood: An equation like Equation 2 above 
expresses the probability of a correct response to a given item 
from a given cell in the design. The product of these expressions 
over all the items and cells , appropriately weighted to reflect 
the numbers of attempts each observed proportion represents , \Ls 
the probability of the entire data set, as a function of item ahd 
group-effect parameters. Item and group-effect parameters are 
then found that maximize this probability. (See Appendix A for a 
more technical description of the mo ? del and the estimation pro- 
cedures. ) 

Linking Results Across Assessments 

The Reiser model outlined above has the. capacity for analy- 
zing multiple-matrix samples of item responses with the itemr 
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invariance properties that distinguish the IRC approach. Previous 
use of the model (Reiser's dissertation) considered data from one 

time point only, considering j ust +he propor tions of correct 

response to all items in an objective as observed in the cells of 
a demographic classification of subjects. But charting results 
over time is the raison d'etre of assessment; capabilities for 
linking the results of assessments from different points in time 
is essential to any method of" Analyzing such data. 

In principle it is possible to analyze simultaneously data 
from several p^*nts in time with the Reiser model. All that is 
necessary is the (possibly incomplete) matrix of proportions of 
correct response to tlW items in the objective in question, from 
each cell in the demographic classification of subjects, at each 
point in time. The analysis proceeds as described in the previous 
section, except that the effects which constitute constraints in 
modelled cell probabilities now include a main effect for time 
• and, if. desired, interactions of time and demographic effects 

V 

(i.e., allowing for the measurement of differential progress in 
different subpopulat ions ) . 

This approach has in fact been carried out in the present 
study, with data for two points in time within a single age group. 
The geometric increase in the number of item-by-group cells as 
additional time points are considered, however, leads to an expo- 
nential increase in 'the computing resources necessary to estimate 
the parameters in the model. Clearly this approach is not well 
suited to longitudihal analyses of any complexity. 
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A more managable approach is to estimate the item and group- 
effect parameters from each point in time separately, then link 
the results on the basis. o f ite ms' that are contmon acros s tim e 
points. If the assumptions of the model are correct, the item 
parameters for the linking items in two assessments should differ 
by only a linear transformation: 

Aj* = Aj / m 

Bj* - m Bj - b r 

where the linear transformation f(x)=mx+b translates the item 
parameters from the second point in time to the base scale. The 
same transformation is then applied to the ability estimates of 
the subject groups and group effects. It is necessary, then, to 
be able to estimate values of m and b which will make the each 
item's response lines from the two time points match most closely 
after the item slopes and thresholds from the second time point 
are appropriately transformed. 

Methods of estimating m and b have been proposed by Tucker 
(1948), Lord and Novick (1968), and Haebara (1981). One simple 
approach is to calculate the mean and the standard deviation of 
the item thresholds at both points in time, then choose m and b so 
that the mean and standard deviation of the rescaled Time II 
thresholds matches the corresponding values from Time I. That is, 

m = S(I) / S(II) 
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b •» [S(I)/S(II)] X(II) + X(I>, 

/ * 

jvJiere~~S-UtX-den^^ 

Time k and X(k) denotes their mean. 

This simple procedure does not take into account the fact 

that some item parameters may bej estimated more accurately than 

— ^ s \ 

others , either because more subjects have responded to a partic- 
ular item at a particular point in time or because the item is 
more closely matched to the average of the ability in the popula- 
tion of subjects. Moreover f . linking is based on information from 
threshold estimates only f ingnoring potentially useful information 
from item slope estimates. 

l 

A more sophisticated linking procedure which takes both of 
these factors into account is described in Appendix B. The proce- 
dure is designed to link any number of calibrations, as long as 
the data from all calibrations are linked by patterns of common 

items/ It is not necessary for any item to appear on all calibra- 

j 

tions f but each calibration must share at least two items with 
other calibrations, and each calibration must be at least indi- 
rectly linked with all other calibrations. (Calibration a is 
directly linked with Calibration b if they have an item in common. 
Calibration a is indirectly linked with Calibration z if there is 
a sequence of directly linked calibrations beginning with a and 
ending with z.) 
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CHAPTER III 

EXAMPLES FROM THE NAEP MATHEMATICS ASSESSMENTS, 1972/73 AND 1977/78 

Introduction to the Examples 

The process of constructing scales affording the implementa- 
tion of the aforementioned methods began with a perusal of the 
NAEP classification scheme of unreleased items. Three skill 
element categories comprised of sufficient numbers of items, 
common to all cells yet appearing in unique booklets within each 
cell of the age/year breakdown, were located. The NAEP classifi- 
cations satisfying the criteria were Understanding Mathematical 
Concepts, i.e., value 4 of Cognitive Subtopics , Arithmetic Compu- 
tation, and Algebraic Manipulations, i.e., values 1 and 5, respec- 
tively, of Mathematical Skills Subtopics . Tables 5, 6, and 7 
present the NAEP identification numbers of the items in these 
scales, along with their locations in the various age/year 
assessment forms. 

While items in the first category require the ability to 
translate from one form of symbolism or language to another, those 

in the other demand the, rote application of the learned methods of 

* **** 

arithmetic and algebra. Hence, the examples illustrate the appli- 
cation of the methods to measures of rudimentary as well as ab- 
stract levels of mathematical ability. 



TABLE 5 
DISTRIBUTION OF ITEMS: 
UNDERSTANDING MATHEMATICAL CONCEPTS 



13-YEAR OLDS 17-YEAR OLDS 

1977/78 1972/73 1977/78 1972/73 

NAEP # FORM ITEM FORM ITEM FORM ITEM FORM ITEM 
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TABIiE 6 
DISTRIBUTION OF ITEMS: 
ALGEBRAIC MANIPULATIONS 



13-YEAR OLDS 17-YEAR OLDS 

1977/78 1972/73 1977/78 1972/73 

NAEP # FORM ITEM FORM ITEM FORM ITEM FORM ITEM 
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TABLE 7 
DISTRIBUTION OF ITEMS: 
ARITHMETIC COMPUTATION 



13 -YEAR OLDS 17 -YEAR OLDS 

1977/78 1972/73 - 1977/78 1972/73 

NAEP # FORM ITEM FORM ITEM FORM ITEM FORM ITEM 
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Within the 1977/78 assessment year completion of the scales, 
that is, selection of one item per remaining booklet, was accomp- 
lished- through reference to the three NAEP class ific ations. To 

maximize the number of possible between cell comparisons, items 
common to other cells of the age/year breakdown were granted 
priority in the selection process. 

Because the item classification schema of the 1972/73 assess- 
went differed from that of the 1977/78 asessment, the selection of 
items was based on an item-by-item scrutiny of the available pool. 
Once again, items common to other cells were given selection 
priority. 

\ ' ■* 

T^e resulting scales va *Y in the total number of items as 
well as ^n the number of amon'g-cell i tern communal it ies • For, 
Example, Understanding Mathematical Concepts is defined by ,a total 
.of 17 items! Of similar content. The number of items within any^ 
one cell of the age/year breakdown ranges between 7 and 11; pairs 
of) cells share between^ 3 and 6 items. Likewise, the Arithmetic 
Computation scale is comprised of a total of 20 items, the number 
of items within any v cell falling in tha interval of 9 to 11, the 

!between-cell communalit ies ranging from 5 to 7 items. Finally, a 

I ■ ^ 

\total of 17 items define the Algebraic Manipulation scale, the 
number of items within each cell varying from 7 to 11, the number 
of shared items varying from 4 to 8. 

/ \ 
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Within each cell subject groups are defined according to a 
multi-way demographic classification. The cross-classification is 
based on £our-^a«rab^ — 
munity, and region of the country. 

o 

* O . JT 

Methodology ' 

In order 'to obtain item parameter and subgroup effect esti- 
mates on a common scale across years and age groups, the following 
steps were taken in« each of the three skill areas: 

1. Fit the Reiser group-effects model to data from eac^h age/ 
year separately. 

2. Establish unit-size and' location of scale witt} respect to 
the results -of 1977 13-year olds. 

3. Determine optimal linear transformations of remaining 
age/year results to reference scale. 

4. Transform item parameter and group-effect estimates to 
reference scale. 

The remainder of this_sect ion amplifies these procedures. 

Step 1: Fit Group-Effects Model to each Age/Year Separately 

"\ 

The basic data addressed by the Reiser group-effects »model 
are the counts of numbers of attempts and numbers of correct 
responses to each item observed in each cell of the design on 
persons. The classif iciat i€n of persons used in these examples is 
based on sex (male and female), race (Hispanic, Black, and white), 
region of the country (Northeast, Southeast, Central, and West), 



and STOC, or size and type of ^community (extreme rural, low metro- 
politan, small places? main big >ci ties, urban fringe, medium 
c i t i-e s~,-^d-4iigh--ine t r opo 1 i tan ) , The_desijgn__CQns ist s of 16 8 cells 
-in jail.. Data from persons w/itfci missing ^data in any of these 
variables or not identified in one of the three main racial/ethnic 
categories was excluded from the analyses. 

. Numbers ofL. attempts and - correct responses to each item in a 
skill area were accumulated for each cell in the design, witheach 
person ! s data weighted^ in proportion to his NAEP sampling weight. 
Weights were rescaled so that the sum of weights was equal to the 
number of observations; in this way oversampling was taken into 
account but numbers of observations were not .exaggerated. 

.T ' C P 

In its attempt to explain the observed (weighted) proportions 
of correct response to' each item from each cell in the design on 
persons, Reiser ! s model yields estimates for threshold and slop£ 
parameters for each item (reflecting items 1 relative difficulties 
and reliabilities ) and for contrasts among selected cells in the 
design\n persons* A maximum of 168 contrasts t could be* estimated 
>with th^|prdsent design, including all main effects and all pos- 
sible infractions . Because Reiser's dissertation research sug- 
gested that interactions were generally negligible, only main, 
effects were included here. Simple contrasts were employed for 
sex, race, and region: 

* c 

* Male - Female • , 



Hispanic - white 

Black - white \ s 

Northeast - West 
, S6utheast - West* „•* ^ < ' 

Central - West 

So-called identity contrasts we^re employed for STOC. Con- 
ditional on the effects listed above, the average s'cale-score in 
each STOC category is estimated. The location and unit-size, 
which must be arbitrarily specified, were provisionally set by 
fixing the "extereme rural" effect at -1.00 and the "high metro" 

effect at +1.00. 

\" . • * 

The parameter estimates obtained in a" given run of the Reiser 

" J 

model, /'then, consist of thresholds and slopes for the items pre- 
sented in that age/year, one sex effect, £wd race effects, three 
region effects, and seven STOC' effects. Each estimate is accom- 
panied by a large-sample standard error of estimation, except for 

the two STOC' effects that were fixed to set the scale. 

*'■ / ' * 

• Item, parameter and subject-group effects can be combined to 
produce estimated proportions of correct response to each item in 
each cell. Tests of fit are obtained by comparing these estimated 
proportions, with the observed proportions: Likelihood ratio Chi 
squares have been provided for each run, with numbers of degrees 
of f reedom equal ' to .the .numbers of non-empty ceils times the 
numbers of items presented in the age/year in question, minus the 



number of parameters estimated in the run. Because likelihood ^ 
ratio Chi squares can- be questionable for small cells — and some 
cells in the design, such as high metro female Hispanics in the 
Northeast, are very small— the more robust Freeman-Tukey Chi 
squares are also provided for selected runs for comparison. 

\ 

Step 2: Establish Reference Scale in 1977 13-Year Old Results 

The size of units and the zero point of the scale must be 
arbitrarily fixed in the Reiser group-effects model. The scale 
for these examples has been set by requiring the estimated grand 
mean of 1977 13-year old results to be zero and the distance be- , 
tween the "extreme" rural" and "high metro" STOC categories to be 
two. 

As noted above, the provisional scales for each age/year run 
were set h"; requiring the values for these two STOC categories to 
be -1.00 and +1.00 respectively, so the unit-size in the 1977 
13-year old provisional scale meets specification. The grand mean 

over all 1977 13-year olds was determined by averaging STOC ef- 

\ 

fects, eact^ weighted by the proportion of the population it repre- 
sented. This grand mean was subtracted from all 1977 13-year old 
STOC effects and item thresholds so as to fix the grand mean at 
zero. This scaling is the reference to which the remaining age/ 
year results will be transformed. 

Step 3: Determine Linear Transformations for Remaining Age/Years 



Under the assumption that the items in a scale define the 
same variable across ages and over years, the sets of item thresh- 
old estimates for items presented in two age/years will differ by 
only a linear transf ormat ion , aside for random errors of estima- 
tion. Similarly, the two sets of item Srlope estimates will differ 
non-randomly by a scaling constant only, namely the scaling con- 
stant required in the linear transformation of the item threshold 

estimates. Once the linear transformation has been determined, 

/ 

item parameters and group effects may be put onto a common scale. 

/ 

j * 
The weighted least-squares algorithm described in Appendix B 

has been used to obtain Optimal estimates of the linear transfor- 
mations required to bring the results from the remaining age/years 
to the reference scale established for the 1977 13-year olds. 
Information is utilized from all occurances of an item in two or 
more age/years , j including the precision with which each estimate 
is determined. The goal of the algorithm may be described as 
♦minimizing the squared weighted "differences among item parameters 
estimated in two or more age/years. 

It has been determined that the 1977 13-year old results are 
the reference scale,! so ihe identity transformation is known to be 
appropriate for tha\j age^/year. Estimation error variation df\>tie 
rescaling constants j^has been apportioned across all four age/ 
years, however, to reflect uncertainty in all age years in the 
transformation of group-effect estimates. Table 8 displays the 
^estimates and standard errors of estimation used in the examples. 
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TABLE 8 
RESCALING PARAMETERS 



AGE/YEAR 



SLOPE 



SE 



INTERCEPT SE 



UNDERSTANDING MATHEMATICAL CONCEPTS 



1977 
1972 
1977 
1972 



13 -YEAR 
13 -YEAR 
17 -YEAR 
17 -YEAR 



OLDS 
OLDS 
OLDS 
OLDS 



000 
686 
187 
744 



035 
026 
045 
031 



473 
885 
801 
957 



.118 
.099 
.127 
.125 



ALGEBRAIC MANIPULATION 



1977 13-YEAR OLDS 

1972 13-YEAR OLDS 

1977 17-YEAR OLDS 

1972 17 -YEAR OLDS 



.00*0 
.329 
.824 
.849 



022 
014 
021 
021 



432 
609 
254 
612 



.067 
.071 
.066 
.065 



ARITHMETIC COMPUTATION 



1977 
1972 
1977 
1972 



13- YEAR 
13-YEAR 
17 -YEAR 
17 -YEAR 



OLDS 
OLDS 
OLDS 
OLDS 



000 
762 
577 
675 



021 
021 
018 
022 



614 
100 
888 
443 



.078 
.078 
.068 
.087 



\ 
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Step 4: Transform Results to Reference Scale 

Let f(x)=mx+b be the estimated linear transformation of the 
results for a given age/year to the reference scale,. The trans- 
formation of STOC effects and the grand average, reflecting 
locations along the scale, are accomplished as follows: 

9* = m 9 + b 

2 2 2 2 2 

SE(9*) = Sqrt[(m SE (e) + 9 SE (m) + SE (b)] 

(The adjustment of the standard error neglects a term attributable 

to the covariance of the errors of estimation of m and b, as these 

terms have been found to be negligible.) The transformations of 

sex, race, and region effects, which represent distances along the 

scale, are accomplished by: 

6* = m G . 

2 2 2-2 
SE(9*) = Sqrt[(m SE (e) +0 SE (m)] 

Final estimates of item parameters were obtained by first 
transforming the threshold estimates in each age/year in the same 
manner as STOC effects and slope estimates in the same manner as 
contrast effects, and then obtaining weighted threshold and slope 
averages for each item over all ages and years in which it was 
administered. 

Taken together, the final estimates of item parameters and 
group effects can be used to compute expected proportions of 
correct response to any item in the scale from any cell in the 
design on persons. To facilitate the interpretation of the ef- 
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fects f additional tables of conditional margins have been pro- 
vided; that is, estimated averages for each of the levels in the 
^ex^, race, and region factors, under the assumption of "all other 
factors held constant." The average of the conditional effects 
over all the levels of a given factor in a given age/year, with 
each level weighted in accordance with the proportion of the 
population it represents, is the grand mean for that year. The 
marginal proportions of the factors used in these computations are 
given in Table 9. 

Results 

The results „of the procedures outlined above are summarized 
in Tables 10 through 18 and Figures 3 through 5. Tables 10 
through 12 and Figure 3 concern Understanding Mathematical Con- 
cepts: Table 10 presents rescaled item parameter estimates from 
all four age/years and grand averages, Table 11 presents the 
corresponding estimates of group effects, Table 12 presents the 
conditional margins they imply, and Figure 3 plots item thresholds 
and race/ethnicity averages against the ability scale. Similar 
information for Algebraic Manipulations is presented in Tables 13 
through 15 and Figure 4, and tor Arithmetic Computation in Tables 

16 through 18 and Figure 5. Highlights are ^discussed below. 

\ 

Overall indices of goodness-of-f i t of the group-effects model 
to data. from each age/year for Concepts, Manipulation, and Com- 
putation are found in Tables 11, 14, and 17 respectively. Chi- 



TABLE 9 

SAMPLED MARGINAL PROPORTIONS 



SUBGROUP 1977, AGE 13 1972, AGE 13 1977, AGE 17 1972, AGE 17 



MALE 


A ft ft 

.499 


c n c 


A D 7 


• 3 ± 


FEMALE 


. 501 


A Q C 

• 49 b 


• jl J 


. t 1 3 


HISPANIC 


.060 


c\ c zr 

• 056 


. U4b 


n A "5 


BLACK 


1 C A 

. 164 


• lb/ 


i 1 JU 


• J. *J £. 


WHITE 


.776 


.777 


.816 


.805 


NORTHEAST 


.227 


.248 


.232 


.244 


SOUTHEAST 


.226 


.256 


.229 


.254 


CENTRAL 


.316 


.248 


.327 


.253 


WEST 


.231 


.247 


.213 


.249 


EXTREME RURAL 


.099 


.101 


.100 


.101 


LOW METRO 


.101 


.101 


.099 


.098 


SMALL PLACES 


.332 


.333 


.349 


.364 


URBAN FRINGE 


1 .154 


.106 


.157 


.084 


MAIN BIG CITY' 


.141 


' .120 


.136 


. .115 


MEDIUM CITY 


.071 


.140 


.058 


.139 


HIGH METRO 


.102 


.099 


.101 


.099 



/ 

DATA FROM APPROXIMATELY 24,000 PERSONS IS ANALYZED ' 
IN EACH AGE/ YEAR. 

PROPORTIONS SHOWN ABOVE INCORPORATE NAEP CASE WEIGHTS. 




NOTES: 1. 

2. 



TABLE 10 

/ 

ITEM PARAMETER ESTIMATES: 
MATHEMATICS CONCEPTS 



1977, AGE 13 1972, AGE 13 1977, AGE 17 1972, AGE 17 GRAND AVERAGES 



ITEM THRESH SE SLOPE SE THRESH SE SLOPE SE THRESH SE SLOPE SE THRESH SE SLOPE SE THRESH SE SLOPE SE 
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0 


23 


0 


22 


0 


15 


0 


02 


5-G43009 


1 


92 


0 


28 


0 


14 


0 


02 


















































1 


92 


0 


28 


0 


14 


0 


02 


5-H12025 


















1 


37 


0 


20 


0 


19 


0 


03 


















1 


41 


0 


30 


0 


21 


0 


03 


1 


38 


0 


.17 


0 


20 


0 


02 


5-G20001 


















































4 


62 


0 


25 


0 


25 


0 


04 


4 


,62 


0 


25- 


_o 


-25 


-0" 


04 


5-K5 1020 


















0 


78 


0 


14 


0 


31 


0 


04 


































-■Or 


"78 


d 


14 


0 


31 


0 


04 


5-A2 1032 


















































4 


10 


0 


24 


0 


20 


On 


03 


4 


10 


0 


24 


0 


20 


0 


03 


5-B2201 1 
























* * 



























































ITEM DELETED 
ITEM DELETED 
-I T-BM "DELETED 



QUESTIONABLE DATA ON NAEP PUBLIC RELEASE TAPE. 
CONVERGENCE PROBLEMS IN GROUP-EFFECTS PROGRAM . 
ESTIMATED "THRESHOLD VALUE TOO EXTREME . 



TABLE 11 
ESTIMATES OF GROUP EFFECTS: 
UNDERSTANDING MATHEMATICAL CONCEPTS 



EFFECT 


AGE 13, 


1977 


AGE 13, 


1972 


AGE 


17, 


1977 


AGE 


17,. 


1972 


GRAND MEAN 


. 00 ( 


.14) 




36 1 


.12 ) 


1 . 


87 


( . 15 ; 


Z . 




• 13/ 


MALE-FEMALE 


- . 07 ( 


. 09 ) 




71 I 


. 14 ) 


• 


32 


( . 10 ; 


• 


bb — ~\ 


.1 £■ ) 


H ISP-WHITE 


-2.39 i 


.351 




,76 ( 


. 34 ) 


-2 . 


20 


( . 3 2h 


. ~z . 


JO \ 




black-white: 


-2.93 1 


, .37) 


-2 


.55 1 


.42 ) 


-2 . 


92 


( . 34 ; 


-2 . 


3D A 


. 4b ) 


NE-WEST 


.66 i 


[ .16) 




.00 1 


.13) 


• 


67 


(.17) 


• 


17 ( 


.13) 


SE-WEST 


-.16 1 


: .16) 




.27 1 


(.14) 


• 


13 


( .15) 


• 


18 ( 


.14) 


CENTRAL-WEST 


.67 1 


:.i4) 




.26 1 


; .14) 


• 


72 


( .15) 


• 


34 ( 


.13) 


EXTREME RURAL 


-.53 


(.12) 




.20 


( .10) 


i. 


61 


(.14) 


2. 


21 ( 


.13) 


LOW METRO 


'-.71 


( .27) 




.14 


( .26) 


• 


83 


( .30) 


1. 


57 ( 


.29) 


SMALL PLACES 


-.25 


(.19) 




.54 


( .16) 


i. 


71 


(.21) 


2. 


16 ( 


: .20) 


MAIN BIG CITY 


.03 


(.21) 




.52 


( .20) 


2. 


07 


( .22) 


2. 


50 ( 


: .23L- 


URBAN FRINGE 


.00 


( .21) 




.67 


(.19) 


2. 


22 


( .22) 


2. 


-42 — 1 


r.22) 


MEDIUM CITY 


.69 


( .23) 




.50 


(.19) 


2. 


73 


( .27) 


2. 


58 1 


: .20) 


HIGH METRO 


1.47 


( .13) 


— 1- 


~.5T' 


( .10) 


3. 


99 


(.14) 


3. 


70 1 


: .13) 



CHI SQUARE (LR) 1697.69 
CHI SQUARE (FT) NA 

DEGREES FREEDOM 1178.00 



1367.79 
NA 

954.00 



1628,06 
1478.31 

1147.00 



1518.32 
1082.20 

808.00 



TABLE 12 
ESTIMATED CONDITIONAL MARGINS: 
UNDERSTANDING "MATHEMATICAL CONCEPTS 



SUBGROUP AGE 


13 r 1977- 


- AGE 13 r 197 2 


AGE 17 r 1977 AGE 


1*7 1 
1 / , 1 


_ _____ 

GRAND MEAN 


— 

• 00 i 


.55 


2.03 


__ . jo 


MALE 


-•04 


.91 


2.19 ' 


2.66 


FEMALE 


• 04 


• 19 


1.87 


2.10 


HISPANIC 


-1 • 77 


-.68 


.33 


. 57 


BLACK 


-2.31 


-1.47 


-.38 


- .05 


WHITE 


.62 


1.07 


2 . 53 


2 .34 


NORTHEAST 


. 33 


. 55 




O AH 


SOUTHEAST 


-.49 


.28 


1.54 


2.12 


CENTRAL 


.^4~ 


.81 


2.39 


2.64 


WEST — " " 


-.33 


.55 


1.67 


2.30 


EXTREME RURAL 


-.53 


.20 


1.61 


2.21 


LOW METRO 


-.71 


-.14 


.83 


1.57 


SMALL PLACES 


-.25 


.54 


1.71 


2.16 


MAIN BIG CITY 


.03 


.52 


2.07 


2.50 


URBAN FRINGE 


.00 


.67 


2.22 


2.42 


MEDIUM CITY 


.69 


.50 


2.73 


2.58 


HIGH METRO 


1.47 


1.57 


3.99 


3.70 



AGE-RACE /ETHN I C XT Y 9 ITEM 
6.0 



ABBREVIATED TEXT 



17-W 



13-W 
17-H 
1*7 -B 
13-H 

13-B 



17-W 



13-W. 
17-H 

17-B 



13-H 
13-B 



5-G20001 $Y / (4 BOYS) = ? 
4/0 5-A21032 IF N IS ODD , Jj+J-tS" EVEN 

5-K10010^ ^^ME:fff^XY=^ I/2 x 4 INCHES, OR 2 INCHES 

5-B32632 IF A*B«CAxB )-B , 4*5*<4x5)-5 OR 15 



2.0 



0.0 



-2.0 



5-B31732 ANY NUMBER TIMES ONE IS THAT NUMBER 

5-G43009 TEMPLATE FOR ASSOCIATIVE PRINCIPLE HOLDS FOR BOTH + AND X 

5-H12025 IF X<4, X + 7<1 1 

5-A45532 NEGATIVE NUMBER DIVIDED BY POSITIVE NUMBER IS NEGATIVE 

5-K30004 LINE SEGMENT HM TWICE AS LONG AS NP 

5-A21022 EVEN NUMBER + 2 IS EVEN * 

5-K51020 DISTANCE BETWEEN CENTERS , 

5-B33232 IF Z<6 AND Y<Z THEN Y<6 *. 

5-B11008 A>5 & B>5 INSUFFICIENT INFO. FOR RELATION OF A AND B 



\ 



5-N00002 IF HENRY>B ILL AND BILL>^£TE, THEN HENRY>PETE 



1972 1977 

* ITEM TEXT SLIGHTLY REVISED IN ORDER TO 1 * MAINTAIN SECURITY. 

FIGURE 3 

ITEM THRESHOLDS AND RACE/ETHNICITY CONDITIONAL MARGINS: 
UNDERSTANDING MATHEMATICAL CONCEPTS 



\}0 



d 

:RLC 



t 



TABLE 13 



ITEM PARAMETER ESTIMATES: 
ALGEBRAIC MANIPULATIONS 



1977, AGE 13 



1972 t AGE 13 



1977, AGE 17 



1972, AGE 17 



GRAND AVERAGES 



ITEM 


THRESH 


SE 


SLOPE 


SE 


TH 


5-H1 1025 


0 


08 


0. 


15 


0 


25 


0 


03 




5-G10003 


5 


46 


0 


67 


0 


29 


0, 


04 


4 


5-H1 1007 


0 


53 


0 


12 


0 


34 


0 


04 


0 


5-G50022 


6 


12 


0 


99 


0 


15 


0 


03 


3 


5-G4 3005 




















5-H1 1015 




















5-G44007 




















5-131001 




















5-H1 1010 


















2 


5-H1 1002 


2 


17 


0 


19 


0 


35 


0 


04 


2 


5-H1 1026 


















-2 


5-B40225 


-0 


77 


0 


20 


0 


30 


0 


03 




5-B304-25 


3 


99 


0 


45 


0 


22 


0 


03 




5-H21001 




















5-B20925 




















5-B21325 




















5-B20125 





















THRESH SE SLOPE SE THRESH SE 



SLOPE 



SE THRESH SE SLOPE SE THRESH SE SLOPE SE 



72 1.00 
52 ,0. 1 1 
93 0.81 



11 O. 36 
96 0.56 
0.69 



16 



0.32 0.08 
0.42 O. 10 
0.26 0.07 



0.31 0.07 
0.29 0.07 
O. 32* 0.08 



0 


01 


9< 


31 


6. 


31 


0 


04 


0 


30 


0 


32^ 


.0 


30 


0 


04 


0 


10 


0 


12 


0. 


29 


0 


02 








* 










2 


92 


0 


16 


0 


20 


0 


03 


3 


10 


0 


16 


0. 


26 


0 


02 


0 


90 


0 


24 


0. 


29 


0 


04 


0 


86 


0 


23 


0 


36 


0 


04 


0 


59 


0 


07 


0. 


34 


0 


02 


2 


98 


0. 


15 


0. 


23 


0 


03 


3 


36 


0 


15 


0 


25 


0 


03 


3 


21 


0 


1 1 


0. 


23 


0 


02 


5 


22 


0 


35 


0. 


32 


0 


04 


4 


84 


0 


25 


0 


34 


0 


04 


4 


96 


0 


20 


0 


33 


0 


03 


5 


69 


0 


39 


0. 


25 


0 


03 


4 


81 


0 


28 


0 


24 


0 


03 


5 


1 1 


0 


23* 


0 


24 


0 


02 


6 


18 


0 


49 


0. 


27 


0 


04 


6 


62 


0 


50 


0 


25 


0 


04 


6 


39 


0 


35 


0 


26 


0 


03 


6 


63 


0 


56 


0 


72 


0 


13 * 


8 


89 


1 


26 


0 


42 


0 


10 


7 


00 


0 


51 


0 


65 


0 


09 


















0 


68 


0 


28 


0 


25 


0 


03 


1 


23 


0 


22 


0 


27 


0 


03 


































2 


25 


0 


18 


0 


34 


0 


03 


































-2 


16 


0 


69 


0 


32 


0 


08 


































-0 


77 


0 


20 


0 


30 


0 


03 


































3 


99 


0 


45 


0 


22 


0 


03 


















5 


75 


0 


36 


0 


31 


0 


04 


5 


75 


0 


36 


0 


31 


0 


Q4 


8 


24 


0 


96 


0* 


37 


0 


.07 


















8 


24 


0 


96 


0 


37 


0 


07 



ITEM DELETED; QUESTIONABLE DATA ON NAEP PUBLIC RELEASE TAPE. 
ITEM DELETED; CONVERGENCE PROBLEMS IN GROUP-EFFECTS PROGRAM, 



o 

ERIC 



r; 



TABLE 14 « 
ESTIMATES- OF GROUP EFFECTS: 
ALGEBRAIC MANIPULATION 



AGE 13, 1977 

.00 ' ( .09) 

-.26 (.08) 

-i;8l (.24)' 
-1.84 (.19) 



EFFECT 

GRAND MEAN 

MALE-FEMALE 

HISP-WHITE 
BLACK-WHITE 

NE-WEST .33 (.12) 

SErWE'ST -.34 (.13) 

CENTRAL-WEST .25 (.11) 

^EXTREME RURAL -.57 (.07) 

LOW METRO ' ■ -.91 ( .25) 

SMALL PLACES -.29 (.14) 

MAIN BIG CITY -.08 (.16) 

URBAN FRINGE .67 (.15) 

MEDIUM CITY .24 ( .17) 

HIGH METRO 1.43 ( .07) 



AGE 13, i972 AGE 17, 1977 AGE 17, 1972 



•1.29 
■1.88 




CHI SQUARE (LR) 4504.31 
CHI SQUARE (FT) 1257.52 

DEGREES FREEDOM 814.00 



2146.00 
1275.63 

814. 0"0 



2857.26 
980.67 

'921.00 



1527.28 
1349.41 

U61. 00 



■ERLC 



\ 

\ 



TABLE 15 
ESTIMATED CONDITIONAL MARGINS: 
ALGEBRAIC MANIPULATION 



SUBGROUP AGE 13, 1977 AGE 13, 1972 AGE 17, 1977 AGE 17, 1972 



GRAND MEAN 


.00 


.36 


1.87 


2.29 


MALE 


- . 13 


1 Q 

. iy 


9 n n 
z . u u 


Z . *X Z 


FEMALE 


.13 


. 53 


1 . / 0 


9 Ifi 
Z . ± O 


HISPANIC 


-1.40 


C A 

- . 54 


1 9 
• l z 




BLACK 


-1.43 


1 1 A 

-1.14 


• 1 1 




WHITE 


A 1 

.41 


. to 


9 *5 n 


9 7 7 
z • / / 


NORTHEAST 


.25 


.94 


2.49 


2.80 


SOUTHEAST 


-.42 


.10 


1.41 


2.07 


CENTRAL 


.17 


.61 


, 1.93 


2.17 


WEST 


-.08 


-.20 


1.61 


2.15 


EXTREME RURAL 


-.57 


.28 


1.43 


1.77 


LOW METRO 


-.90 


-.58 


.77 


1.75 


SMALL PLACES 


-.29 


-.13 


1.70 


2.21 


MAIN BIG CITY 


-.08 


-.22 


2.12 


2.38 


URBAN FRINGE 


.67 


-.30 \ 


1.95 


2.20 


MEDIUM CITY 


.24 


.47 


2.60 


2.49 


HIGH METRO 


1.43 


.94 


3.08 


3.47 



\ 



AGE - RACE^/ETHNI CITY 8 



ITEM 



ABBREVIATED TEXT 



17-W 



17-H 
13-W 

17-B 



13-H 
13-B 



17-W 



13-W 
17-H 
17-B 



13-H 
♦ 13-B 



8.0 5-B20925 IF N=3K AND N+K=72, THEN K*1j4 AND /N=54 



6 .O 



5-131001 POINTS (X,Y). ON CIRCLE SATISFY X/ SO + Y SO 

5-G44007 FACTORS OF X SQUARE - 5X + 6 AR,E (X-2) AND 

/ ■ 

5-H21001 FIND SOLUTION SET OF ( X- 1 ) ( X + 7/) =0 

5-H11015 IF 3X + 6 - 14 = X + 2 THEN x/»5 
5-G43005 (2X-1)(X+3)= 2X SQUARE + 5X 4 3 * 



= 36 
(X-3) 



4.0 5-B30425 / 3X + 5Y + 4X = 7X + 5Y * / 



5-C50022 / IF A/B «= C/D, THEN AxD = BxC IS TRUE 
5-G 10003 / 1/3 x A/2 = A/6 



2.0 



5-H 11002/ 5 IN" BOX MAKES 3(B0X +6)r21 TRUE * 



5-H11O10 IF 3X-3 - 12 THEN X=? 



0.0 



5-H11007 IF 2/3 = X/15 THEN X 10 * 
5-H11025 IF X + 2 > 7, X MUST BE f> 5 * 



5-B40 



-2.0 



:25 THE VALUE OF X+6 WHEN/ X = 3 IS 9 



5-H11026 IF X-3 = 7, THEN X s 



1972 1977 
* ITEM. TEXT SLIGHTLY REVISED IN\oRDER TO MAINTAIN SECURITY. 



\. 



FIGURE 4 / 

I 

ITEM THRESHOLD^ AND RACE/ETHNICITY JCONDITIONAL MARGINS: 
ALGEBRAIC MANIPULATIONS 



ERLC 



i 0 



TABLE 16 

ITEM PARAMETER ESTIMATES: 
ARITHMETIC COMPUTATION 



1977, AGE 13 1972, AGE 13 1977, AGE 17 1972, AGE 17 GRAND AVERAGES 



TTCU 

I i c rn 


THRESH 


S E 


SLOPE 


SE 


THRESH 


SE 


SLOPE 


SE 


THRESH 


SE 


SLOPE 


SE 


THRESH 


SE 


SLOPE 


SE 


THRESH 


SE 


SLOPE 


SE 


5-C20006 


1 


40 


0 


16 


0. 


20 


0 


02 


2 


04 


0 


26 


0. 


28 


0 


04 


1 


73 


0 


20 


0. 


38 


0 


05 


1 


1 1 


0 


35 


0. 


30 


0 


04 


1 


58 


0 


1 1 


0 


29 


0 


02 


5-FOC006 


3 


12 


0 


29 


0 


24 


0 


03 


2 


74 


0 


35 


0. 


22 


0 


03 


2 


81 


0 


14 


0. 


26 


0 


04 








* 










2 


86 


0. 


12 


0 


24 


0 


02 


5-C 10049 


-0 


86 


0 


23 


0 


25 


0 


03 


-0 


51 


0 


18 


0 


24 


0 


03 


-0 


85 


0 


68 


0 


24 


0 


04 


-0 


38 


0 


61 


0. 


26 


0 


04 


-0 


63 


0. 


13 


0 


25 


0 


02 


5-C30010 


1 


28 


0 


15 


0 


22 


0 


03 


1 


59 


0 


22 


0 


22 


0 


03 


0 


69 


0 


42 


0 


21 


0 


04 


1 


31 


0 


35 


0 


26 


0 


04 


1 


32 


0. 


1 1 


0 


23 


0 


02 


5-A23009 


3 


41 


0 


34 


0 


20 


0 


03 


2 


09 


0 


28 


0 


22 


0 


03 


3 


14 


0 


16 


0 


20 


0 


03 


3 


22 


0 


16 


0 


23 


0 


03 f 




06 


0 


10 


0 


21 


0 


02 


5-A22010 


5 


42 


0 


58 


0 


26 


0 


04 


4 


79 


0 


66 


0 


25 


0 


04 


































5 


1 1 


0 


42 


0 


26 


0 


03 


5-C 10009 


-4 


75 


0 


90 


0 


17 


0 


03 


-6 


63 


1 


29 


0 


12 


0 


02 


































-5 


37 


0 


74 


0 


15 


0 


02 


5- F 30006 


































6 


14 


0 


45 


0 


30 


0 


04 








* 










6 


14 


0 


45 


0 


30 


0 


04 


5-A45232 


3 


49 


0 


34 


0 


20 


0 


03 


















































3 


49 


0 


34 


0 


20 


0 


03 


5-A34632 


5 


.60 


0 


70 


0 


16 


0 


02 


















































5 


60 


0 


70 


0 


16 


0 


02 


5-A1 1832 


-1 


60 


0 


28 


0 


29 


0 


03 


















































-1 


60 


0 


28 


0 


29 


0 


03 


5-B31225 


































3 


95 


0 


18 


0 


32 


0 


05 


















3 


95 


0 


18 


0 


32 


0 


05 


5-B 13002 


































0 


71 


0 


33 


0 


40 


0 


.06 


















0 


7 1 


0 


33 


0 


40 


0 


06 


5-A31732 


































4 


21 


0 


22 


0 


26 


0 


04 


















4 


21 


0 


22 


0 


26 


0 


04 


5-F00007 


















































6 


35 


0 


41 


0 


27 


0 


04 


6 


35 


0 


41 


0 


27 


0 


04 


5-C20021 


















2 


.08 


0 


26 


0 


26 


0 


.03 


















1 


37 


0 


37 


0 


20 


0 


03 


1 


.84 


0 


21 


0 


24 


0 


02 


5-F00A03 
5-C20022 


















































1 


08 


1 


.09 


0 


19 


0 


03 


1 


08 


1 


09 


0 


19 


0 


03 


















































-o 


70 


0 


91 


0 


13 


0 


.03 


-0 


70 


0 


91 


0 


13 


0 


03 


5-C1001 1 


















-5 


.37 


0 


94 


0 


15 


0 


.02 
























* * 








-5 


.37 


0 


94 


0 


15 


0 


02 


5-C30001 















































































* ITEM DELETED; QUESTIONABLE DATA ON NAEP PUBLIC RELEASE TAPE. 
*+ ITEM DELETED; CONVERGENCE PROBLEMS IN GROUP-EFFECTS PROGRAM. 



'i 




TABLE 17 
ESTIMATES OF GROUP EFFECTS: 
ARITHMETIC COMPUTATION 



J 



EFFECT 


AGE 13, 


1977 


AGE 13, 


1972 


AGE 17, 


1977 


AGE 


17, 


1972 


GRAND MEAN 




. 00 


1.17) 




.19 




2 


.43 


I 19) 


o 


u ^ 


( 15 ) 


MALE-FEMALE 




.13 


f f\o \ 

i • 08 ; 


_ 


.20 


( no \ 




.33 


\ . UO 1 


• 




( Oft ^ 


HISP-WHITE 


-2 


.51 


1.30) 


-2 


.17 


\ . 3 Z ) 


-1 


.68 


1 01 } 
\ • £■ 1 I 




d y 




BLACK-WHITE 


-2 


.61 


( . 25 ) 


-3 


.38 


( . 42 ; 


-2 


.15 


1 OQ ^ 

\ . I 


_ o 


D Z 


( ) 


NE-WEST 




.33 


( .13) 


1 


.13 


(.19) 




.68 


(.14) 




62 


( .13) 


SE-WEST 




.92 


( .16) 




.03 


(.12) 




.05 


(.12) 




20 


( .10) 


CENTRAL-WEST 




.00 


(.12) 




.82 


( .15) 




.34 


( .10) 




50 


(.11) 


EXTREME RURAL 




.39 


( .08) 




.66 


( .08) 


2 


.31 


( .07) 


2. 


77 


( .09) 


LOW METRO • 




.60 


( .25) 


-1 


.47 


( .27) 


1 


.65 


( .23) 


2. 


31 


( .22) 


SMALL PLACES 




.25 


( .16) 




.36 


( .15) 


2 


.20 


(.14) 


2. 


93 


(.14) 


MAIN BIG CITY 




.44 


(.19) 




.09 


(.17) 


2 


.54 


(.14) 


3. 


10 


( .17) 


URBAN FRINGE 




.58 


( .16) 




.35 


( .16) 


2 


.60 


(.14) 


3. 


08 


( .16) 


MEDIUM CITY 




.02 


(.19) 




.19 


( .16) 


2 


.89 


( .16) 


3. 


10 


( .15) 


HIGH METRO 


1 


.61 


( .08) 




.86 


( .08) 


3 


.47 


( .07) 


4. 


12 


( .09) 



CHI SQUARE (LR) NA 2082.84 3105.16 1569.53 

CHI SQUARE (FT) NA 1584.01 1268.41 1417.59 

DEGREES FREEDOM NA 1064.00 911.00 1150.00 
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TABLE 18 ' 
ESTIMATED CONDITIONAL MARGINS: 
ARITHMETIC COMPUTATION 



SUBGROUP AGE 


13, 1977 


AGE 13, 1972 


AGE 17, 1977 


AGE 17, 1972 


GRAND MEAN 


.00 


-.19 


2.43 


3.02 


MAI.F. 


- 07 


29 


2.60 


3.19 


FEMALE 


.07 


-.09 


2.27 


2.84 


HISPANIC 


-1.93 


-1.67 


1.13 


1.80 


BLACK 


-2.03 


-2.88 


.66 


.88 


WHITE 


.58 


.49 


2.80 


3.49 


NORTHEAST 


.46 


.45 


2.85 


3.02 


SOUTHEAST 


-.79 


-.65 


2.13 


2.90 


CENTRAL 


.14 


.14 


2.52 


3.19 


WEST 


.13 


-.68 


2.18 


2.69 


EXTREME RURAL 


-.39 


-.66 


2.31 


2.77 


LOW METRO 


-.60 


-1.47 


1.65 


2.31 


SMALL PLACES 


-.25 . 


-.37 


2.20 


2.93 


MAIN BIG CITY 


-.44 


-.09 


2.54 


3.10 


URBAN FRINGE 


.58 


.35 


2.60 


3.08 


MEDIUM CITY 


.02 


.19 


2.89 


3.10 


HIGH METRO 


1.61 


.86 


3.47 


4.12 



AGE-RACE /ETHNICITY 8 ITEM 



ABBREVIATED TEXT 



17-W 



| 5-FOOO07 
6.0 5-F30006 



5-A34632 
5-A22010 



4.0 5-A31732 
5-B31225 
5-A45232 



17-W 



3**0 - ? 

6 IS THE WHOLE NUMBER NEAREST SORT OF 38 * 

3 20/15 IS NEXT STEP OF SUBTRACTION PROBLEM 
(3)(-3) + 4 = -5 



LEAST COMMON DENOMINATOR OF 7/15 & 4/9 IS 45 
2 TIMES SQRT 5 = SORT 20 * 

300.00/36 IS FIRST STEP FOR 3 DIVIDED BY .36 



5-A23009 EXPRESS 9/100 AS 9% 



17-H 




2 


0 


5-C20021 


1/2 +1/3 = ? 












5-C 20006 


2/3 OF 9 = 6 












5-C30010 


( .4) x (3.6) = 1 .44 


* 




17-H 






5-F00003 


4**3 - ? 




17-B 


13-W 






5-B 13002 


(+3) + (-3) = 0 * 




13-W 


17-B 
















0 


0 
















5-C 10049 


420 DIVIDED BY 35 = 


12 










5-C20022 


(1/2)0/4) = ? 




13-H 








5-A1 1832 


3/9 IS THE SAME AS 


1/3 




13-H 


-2 


0 










13-B 












13-B 






: 












-5 


.0 


5-C1001 1 


SUM OF FOUR NUMBERS 












5-C 10009 


43 + 7 1 + 75 + 92 * 


281 


1972 


1977 













* ITEM TEXT SLIGHTLY REVISED IN 0&DER TO MAINTAIN SECURITY. 

FIGURE 5 

ITEM THRESHOLDS AlQo RACE/ETHNICITY CONDITIONAL MARGINS: 
ARITHMETIC COMPUTATION 



I squares less than twice their degrees of freedom are considered 

I indicative of acceptable fit; it may be seen, however, that sev- 

eral of the likelihood ratio (LR) Chi-squares exceed this value. 
I Freeman-Tukey (FT) Chi-squares, on the other hand, range between 

one and one-and-a-half times their degrees of freedom, suggesting 
I a highly satisfactory goodness-of -f i t . Inasmuch as the two in- 

- dices are asymptotically equivalent but the Freeman-Tukey Chi- 

" square is less susceptible to problems with small cells, it would 

I appear that the observed proportions of correct response in the 

examples are well-explained by the group effects model and the 

r 

I parameter estimates. 

| It will be recalled that an item's threshold is the point 

along the ability scale at which we would expect 50-percent cor- 
I rect responses to the item. Group averages may be interpreted in 

terms of item content, then, by inspecting the content of the 
I Hems in the region of the scale at which the average falls. The 

■ group's proportion of correct responses would be about 50-percent 

for items in that neighborhood, less than 50-percent for items 
| with higher thresholds, and greater than 50-percent for items with 

lower thresholds. In this way the content of items with thresh- 
I olds at various points along the scale forms a picture of the 

ability scale upon which group effects are measured. 

Figures 3 through 5, depicting the example scales^, show 
I f reasonable patterns of increasing complex or advanced item content 

I at increasing levels of 9. Algebraic Manipulations and Arithmetic 

I 
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Computation show a broader and more evenly-spaced distribution of 
items along the scales than does Understanding Mathematical Con- 
cepts. Items from the latter scale are more concentrated in the 
area that includes average 13-year olds and 17-year olds, but more 
sparse in the lower regions of the scale. 

Under the assumpt ions of the model, item parameters in a 
scale are invariant across ages and assessment years. If this is 
true, progress may be charted in terms of ability estimates alone; 
changes in the value of the global ability correctly reflect 
changes in probabilities of correct response to each individual 
item in the scale. Departures from this assumption, such as 
varying change over years from one item to another, are revealed 
as discrepencies among an item's parameter estimates in different 
age/years, after optimal rescaling (Tables 10,^ 13, and 16). 

3 * 

' An examination of these tables shows few age/year item para- 
meters further than one-and-a-half standard errors of estimation 
from the corresponding grand averages; in other words, the assump- 
tion of invariant item parameters acorss the ages and years in the 
examples is reasonably well satisfied. The interpretation of 
cases in which certain items were unexpectedly hard or easy in a 
particular age/year are left to curricular experts, although one 
pattern is suggested in. the results for 1977 13-year olds in 
Algebraic Manipulations: both items found unexpectedly difficult 
in this age/year, compared to results on the other items in the 
scale, deal with solving fractions equations. In the main, how- 



ever, the assumption of invariant scales across age/years and the 
subsequent discussion of trends in terms of ability estimates 
rather than for individual items are just if yable . 

The universal test score decline of the seventies spans the 
period covered by our examples, and with^minor exceptions, appears 
in all "skill areas, age levels, and demographic subgroups ad- 
dressed here. Only in the area of Arithmetic Computation and only 
for 13-year olds did levels of performance increase. In Concepts 
and Manipulation, equal decline was observed at both ages. 

Male versus female contrasts in all three skill areas exhibit 
an interesting age-by-sex interaction: 13-year old females outper- 
form 13-year old males, but 17-year old males outperform 17-year 
old females. (An exception is 1972 Concepts, where 13-year old 
males outperform females). One possible explanation of phis 
result is that the well-established superiority of males in cer- 
tain areas of mathematics (Anastasi,. 1958) is manifest io the more 
abstract tasks in the higher regions of the scales b\|t overwhelmed 
by superior study habits of females in the elementary grades on 
the less abstract tasks in the lower regions of the scales. 

Mr 

Race/ethnicity contrasts uniformly exhibit highest levels of v 
performance by whites, followed at a distance by Hispanics then 
Blacks. The magnitude of the difference is such that the averages 

r 

of 13-year old whites equal or exceed those of 17-year old blacks. 
A comparison of 1972 and 1977 results shows blacks at both age 



levels catching up somewhat in Arithmetic Compuation but both 
black and Hispanic 13-year olds falling further behind in Under- 
standing Mathemati^lJZo^^erpts. In the remaining ages and skill I 
areas, relative , posit ions* among the race/ethnicity groups remained 
about the same. l , 

Contrasts among different regions of the country are of a * 
much smaller magnitude. Performance is highest in the Central 
region, generally followed by the Northeast, West, and Southeast. 
The period covered by the examples saw a shift "of population from 
the Northeast and Central regions to^ the Southeast and West; 
possible correlates of this shift are visible in region contrasts 
and margins. In Concepts, the distance between the Northeast and 
Central averages and tfie Southeast and West averages increased at, 
both age levels from 1972 to 1977. Similar gaps in Manipulation 
decreased for 13-year olds but increased for 17-year olds; gaps in 
Compuation also decreased* for 13-year olds but remained unchanged 
for 17-year olds. ^ 

The results for size and type of community (STOC) show the* 
effects of a high concentration of well-educated and highly-paid 
professionals" on the level of achievement in a neighborhood. The 
low .metropolitan areas have a low level of income and few profes- 
sional reside in them; hence, the level of achievement is low. 
Levels of income and proportions of professionals rise as one goes 
form low metropolitan areas to rural areas, small places, main big 
cities, and to urban fringe areas. Finally, in urban areas where 

9 
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the levels of income and education are highest, young peoples' 
levels -of performance are highest also. 



\ Declines in performance were generally more pronounced in the 

"A ■ v * 

STOC categories that were lowest to begin with — i.e., low metro-r 
politan and rural areas — but less pronounced in the higher STOC 
categories. In fact, the high metropolitan category ^ showed in- 
creases as of ten >as declines, particularly among 13-year olds. 
'Aside Worn the concern of general decline, then, there is evidence 
of increasing disparity in the relative positions of communities 
as t ime \progresses . 



X* 



CHAPTER IV 
CONCLUSIONS 

i 

The Reiser group-effects model was successfully used to link 

j * . ■ s- k 

data across two age levels and over two time points in each of * 

'three^skill areas of the National Assessment of Educational Prog- 
ress surveys of mathematics.. Experience gained in this effort 
lead to several^important conclusions concerning the application 
of item response methods in general and of the group-effects model 
in particular to the National Assessment. 

Items grouped at the level of NAEP subtopics proved satisfac- 
tory for scaling with a unidimensiohal model, even across age 
levels and assessment years. Goodness-of-f i t indices within the 
age/year data matrices and successful links across ages and years 
imply that trends and group differences can be profitably analyzed 
at this higher level of abstraction thai, the individual itetn, yet 
allowing for the ^administration of different subsets of items to 
different age- groups and at different points in time. This find- 
ing is particularly* fortuitous \hen seen in the light of the NAEP 
multiple-matrix sampling design; the items from a -subtppic are 
generally spread- over several test booklets. Such a scheme yields 
more precise estimates of group-level attainment than a scheme 
that presented more items, from a scale- tp f e.w.ervdif f erent persons. 



More inclusive and broader-ranged collections of items would 
not have lead to* satisfactory results. The ^-combined calibration 
of .Arithmetic, Computation and Understanding Mathematical Concepts , . 
for example, could not have shown how blacks were^closing the gap 
from whites in the former area but lagging farther behind in the 
latter. The need for scales that maintain their integrity over 
time, then, requires rather narrow domains for scaling. While it 
may be convenient with current NAfi'P data tapes to scale together 
all the items that happen to appear in the same booklet, the 
intentional heterogeneity of such a collection virtually guaran- 
tees a poor fit to any 'unidimensional \ item^ response model and 
severe item parameter drift over time. Under current NAEP item- 
sampling designs, the practice of- item response scaling within 
NAEP booklets should be most strongly discouraged . 

Given that scaling must be accomplished within > fairly narrow 
skill areaMeTg., NAEP subtopiGs) , methods of summarizing* resiilt.s, 
over these areas must Jae determined. If levels of performance 
increase in comptitat ronal skills but decrease in understanding" 
concepts, as an example , : what 'should^be said about skill in mathe- 
matics as a whole? Clearly* some, scheme of indexing or weighted^ 
averaging is required, .with explicit rules by which the informa- 
tion from, the ' separate skills is combined. , ^ 

Within these restrictions, alternative methods of scaling are 
available. This project Has made more clear some of the advan- 
tages and disadvantages of one of c those alternatives, namely, 0 tl\e # 
Reiser modeL for group effects. 
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Of great advantage to this/ project was the fact that numbers 
of attempts and correct responses to each item in a scale from 
each cell ip a design on persons are sufficient for estimating 
item parameters and group effects. A summary file at the level of 
groups of persons rather than L full file at the level of individ- 
uals need be handled. This same feature of the model, however, 
may be seen as a disadvantage as well. Because the model adresses 
data at the level of cells in the design on persons, there are 
practical limits to the complexity of the design that may be 
employed before the numbers of persons in the cells become too 
small. The design used in these examples contained sex, race/ , 
ethnicity, region of the country, and size and type of community — 
168 cells in all. Several of these cells were small or empty, and 
it is clear that not many additional factors could be included in 
the design before there were more cells than observations. 

In sum, these applications of the group-effects model can be 
considered successful as a demonstration of the practicality of 
applying item-response methods to the efficient multiple-matrix 
data of modern assessments. Whether the group-effects model or a 
close cousin eventually dominates, the generic advantages of item 
response theory are sure to advance the practice of assessment. 

i 
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APPENDIX A 
A LATENT TRAIT MODEL -FOR GROUP EFFECTS 
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A Detailed Specification of the Model 



The development of the first two, sections of this chapter 
parallels that of Bock (1976), . In the first 

part the model is stated in terms of a binomial response function. 
In the second part, maximum likelihood estimates are derived 
for tne parameters in the model. The last section consists of 
a discussion of the asymptotic properties of the estimates. A 

a 

test of fit for the model is also discussed in this section* 

Two symbols which are used repeatedly in ^nis chapter 
require a brief explanation. 3 is used as a summation sign, 
instead of the more common upper case sigma, and d is used as 
the symbol indicating a derivative . ' 

Assume that" subjects respond to one item from the set 
of items which constitute the scale', and that subjects are 
assigned to f homogeneous sample groups. .Assume also that 
subjects" in group q are a probability sample from a conditionally 

normal* latent trait distribution with mean represented by the 

i 2 
contrast k 0 and variance CT . 

k represents the q row of the general design 

T"-q r 

matrix K , » 

a 

'a-i 



k ll k l? k 13 



'31 



h ls 



'fs 



s is the ; rank of the model for estimation. 

0 represents a vector of contrasts among the group effects. 
A subject's response to item j is scored 



h . - { 



1 if correct 



0 otherwise 



The probability that the subject (or respondent) responds 
correctly is given by the logistic ogive (logit) model. The 
logistic curve is used here as an approximation to the much 
more complicated mormal cumulative distribution function. 
Haberman (1974, pg 34) concludes that no emperical evidence 
exists that the normal distribution provides more accurate models 
than the logistic. So, 



Pfrn = 1) = Flz-,) = 1/(1 + exp(-2 .)), and 



qj - 0) - 1 - F(z q j) 



J 



As mentioned in chapter 1, the design of the sample groups 
is introduced into the specification of the logit, z 



q3 



t 

z . * c . + a .k G, 



where c. and a- are parameters for item j. 
j j i 

The priniple OF local independence states that responses 
to items are independent, conditional on item and group 
parameters. By this principle, the probability of r ^ correct 
responses from the N q j respondents in group q who attempt 
item ] is given by the binomial function: 

a 

.P(r G IN k c , a ) - 3i -— P tqj (z qj )ll - F( Z qj )] Nq3 Cqj 

The probability of the entire sample "is taken over groups 
and items: 



f n . 
KZ] p 7T7T p ( r q j' N qj' JSq 2' £ j ' £ j > 



For a t-way design on the subjects, the factor indices 

i vary from 1 to m for w « l,...t. The number of cells 
w w 

. * ^ , , . ^ m . Any cell in the 

in the design, t, is equal to w 

design can be referenced by a single subscript q as follows: 



<5 3 i l + S (i w " X) "^"r 
1 2 r<w L 



Equation (1) specifies a quantal-response type model 
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wmcn is closely related to the probit model of Finney (1971) 
and tne logit model of Berkson (1944) . Finney uses the 
cumulative normal distribution, but as stated . bef or e , 
there is no emperical evidence for prefering the normal over 
the logistic, and the logistic is considerably simpler. 
The logistic quantal response models are log-linear models, 
and thus many of the methods and results from Haberman (1974) 
are applicable to the present model. 



Per ivation of Parameter Estimates . 

Estimates for the item and group parameters can be obtained 
in a straight forward manner by the method of maximum likelihood 
As will oe seen in this section, there are two linear 
dependencies among the set of vectors which consists of the 
columns of the information matrix. In order „to eliminate 
these dependencies, two parameters must be fixed arbitrarily . 
One choice which can be made here would be to fix the first 
v aj^d the m^ th effects from the first factor of the design 
respectively. This sets the scale of all the estimates in 
a very convenient range. Another choice for eliminating 
one of the dependencies* would be to include a prior 
distribution for the item slope parameter within the model. 
Some previous experience with two parameter .models has 
snown that including this prior knowledge oresults in a more 
well benaveo solution in tne sense that parameter estimates 
for items oji which there is little information in the data 
will not take on a value wnicn is unduly l^rge. what 
happens in practice is that the Slope 'parameter can become 



very high for an item on which the responses are either nearly 
all correct or nearly all incorrect. For such an item, the 
information provided by the prior distribution becomes dominant, 
and the solution is primarily a function of this information. 

Since such items, add essentially nothing to the likelihood 
of tne daca, eliminating them from the analysis entirely 
constitutes an equally effective strategy. However, the prior 
distribution alternative renders the model more robust in the 
sense that less work with the data will be necessary before 
satisfactory estimates are obtained. Consequently, at some 
points in the derivation, information will be included 
describing changes that would be required in the equations in- 
order to obtai 7 n maximum aposteriori density (MAP) estimates. 
A complete derivation of the MAP estimates would be 'nearly 
the same as the derivation of the maximum likelihood estimates, 
and so it would be needlessly repetitive. 

For the maximum aposteriori. density estimates, the slope 

parameter, a., is assumed to be distributed log^normally with 

2 

mean u and variance .<7\> • These two parameters for the „ * 

9 e * a- a t . 

distribution are given values by the researcher before estimates 
of tfle item parameters are obtained. A state of nearly total 
ignorance about the prior distribution can be indicated by 
specifying a large variance. The results of Lindley and Smith. 
(1972? show that it is more reasonable to estimate the mode 
rather the the mean of the posterior distribution , so the 
easier path will^ be taken* here. 

For maximum likelihood estimates, the likelihood of the 
entire sample is- obtained directly from expression (2): 




j£(c, a, o) » s s [const + r . log F(z .) + 
q j 4J _ 4J 



{N qj " r qj ) 109(1 ' F{Z qj ,,] 



Sock and This sen (1979) show the general form for 
tne logarithm of the posterior density. In tnis setting, it 
taKes the following form: *» 



< (c, a, a) » s s [const + r . log F(z .) + 
q j <Z1 <Z1 

(log a. - fA a ) 2 

[\i :- r qj ) 109(1 " F(Z qj ,,] " T- S " 

> 2 
v a 



Notice that the difference betweeen these two equations 
consists only of a t^rm, after the minus sign'on the right, 
which represents the prior information. 

The following are obtained now for use later: 



F(z .) = exp(z )/(exp(z . ) + 1) = 1/(1 + exp(-z .) 

1 - FJz qj ) exp(z qj ) + l)/(exp(z q /) + 1) - 

o 

exp(z . J/(exp(z .) + 1) 

£5 

» 1/(1 + exp(z q j)) = 'exp(^z j-)/(exp(-z ^ 



-««t}.~ = + ex P (-z qj )) d 2q .. « 



exp(-z gj )/(l + exp(-z q j))(l + exp(-z q j)) 



P(z qj )tl - P(z qj )] 



d[l - F(z .)] 

„— S2~ = -F<z qj )tl-F<z qj ) 



dz . dz rt . , dz^ . . 

1 da.. *q - a0 g *qg 3 



Once trie livelihood function has been chosen, the maximuta of 
the functin witn respect to a gi^en parameter is otten the 



point at which the rate of change of trie function, the first 
derivative, is equal to zero. Such a poin^ could also be a 
minimum or a boundary point, so other* aspects o.t the likelihgod 
function have to be investigated. We will attend to these other 
aspects shortly. 



Maxima: 

•» 

r 

d H r • dz . 



(K . - C . ) ' i 1 dz . 

, -22 32./ nz , [i - P( , > ] —22] 

.11 - F(z .)} . q:i q3 dc. - 



A-7 - " ° 



\4 



6 J. 

da^ 



3 tr qj " N qj F(2 qj )] ^ 8 .= - 0 .4 V- 



» - • • V 



Li 

do 



For MAP estimates , the preceeding equations^would differ 
only- in the presence ,of' the so-called penalty function as a 
second term in derivative with respect to a.: 



d | * ^ \ log a . -Jtfa 
= S [r . - N . F (z ' ) } k 0 - r- 1 *- = 0 



da .< 



« q] -q • 



3 * 



If B . i<s set equal to r . - N .F(z . ), -the preceeding 

qj qD q] rt qj • . . 

likelihood equations can be rewr itten in simpler expressions: 



\ 



a . : S . Bl . k Q 

3 a* qj Zq ~ 



Q f S S B _ k a. 



q j 



qj 5ft, 3 



= a . 



•These equations cannot be solved explicitly tor the unknown 
parameters/ but estmates can be obtained^by ah* iterative 



numerical proceaure such as tftat of Newton-Rapnson. Second 
derivatives of the log liKeliftood< are required for the purpose 
of investigating the shape of the likelihood function and for 
use in the Kewton-Raphson procedure. - 



Second derivatives ; 



d Cj dc h 



2 i. 

3 J n 



d 

dc aa 



da. da. 
3 n 



G.. S T N .F(z .) [1 
€>lh j qj qr 



da. \30 g 



S-^F(Z qj )[l 



Hz qj )] k qg a. 



d 2 j 

da .- d0 g 



dG^ d0 H 

g h 



u 1 



8 -yi^m - r<« qJ )J (fc, BH^-i _+ B qj R qg 



4 J 



wnere 



1 if j=h 

is known as Kronecker ' s delta. 
0 otherwise 
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Only one of the second derivatives wouTd differ if we were 
deriving MAP estimates. The* derivative taken twice with respect 
to the slope would include another term on the right hand side 
of the equation: 



d < [ '2 1 - log a. + fA 

- ih S -N .F(z .) [l\- F(z ,)] (k Q) 2 2- 2 -2~- 

/ < da. da h 3ft q qj qD qj ^ <r a . 

i 

i t 

[ -• • 

By us.19g.the expected values for r . and B . in the 

aodve equations the elements or the information matrix can 

V 

oe obtained. ^ 



B ( B ) = N.F(z.)-N.F(z.)=0, 

qj qj qj qn qr 

Also, set W qj F(z qj )Jl - P(z qj )]. 

The elements of the information matrix arelihen as follows: 



A, , : E(- ■-) = C-ih s *~4* n 4 

11 d Cj dc h <5-3 h q <« <" 



= S N .W .k a. 

q qu qD qg j 



d 2 i 



i 



B s E( ) = S N W (k 0>k a 

2 da^ d0 g q 4J 4J M 



d 2 * 

C: E( ) = S S W ,W ,k a kqhaj 

d0„ dO. q ] 4J MJ J J 

g n 

"The information matrix, I (c , a, 0), takes the form 
I (c , a, 0) '■ X'WX where 

W - diag[K n W 11 ,rl 21 W 21 ,...N fl W fl ,N 12 W 12 ,...N fn W fn ] 

is positive definite. 



A-n 



of linear comoinations of columns with linearly independent 
coerficients 0, 0, 0, . . . 0, a^, a^, a^t • • • a^, -O^ * 

-0 -q . . . -0 . It is anticipated tnat identity 

2 3. s 

contrasts will always be used over the first factor during 
parameter estimation, hence another dependency exists among the 
columns of X* as. a result of these identity contrasts. 
The linear combination of columns with linearly independent 
coefficients a^ , a^, • • . a n , 0, 0, 0, . • . 0, -1, 

-1, -1, . . . -1, 0, 0, 0, . . . 0 shows the dependence. If 
any two* parameters are arbitrarily fixed anpL^the corresponding 
likelihood equations deleted, the information matrix with the 
corresponding rows and columns deleted is 'positive definite. 
A necessary and sufficient condition for the log-likelihood 
function to be concave and have a unique maximum is that this 
Hessian matrix (matrix of negative of expected value of second 
derivatives) is positive definite. In the limit, therefore, 
unique maximum likelihood estimates of the parameters exist. 
The information matrix can be written in partitioned form: 



I(c, a, 0) = 



A B' 

B C 



where 



' A ll A 12 
A = | 

I A 21 A 22 



A n = diag( 3 K qj W qj 



of linear comoinations of columns with • linearly independent 

coefficients 0, 0, 0, . . • Oi a^, a^, a^, . . . a n ? -© 1 , 

-©,-©....-©. It is anticipated that identity 
2 3 s 

contrasts will, always be used over the firs.t factor during f 
parameter estimation, hence another dependency exists among the 
columns of X as a result of these identity contrasts. 
The linear combination of columns with linearly independent, 
coefficients a 1 , a 2# a^, . . . a n# 0, 0, 0, . . . 0, -1, 
-1, -1, . . . -1, 0, 0, 0, . . . 0 shows the dependence.. If 
any two parameters are arbitrarily fixed and the corresponding 
likelihood equations deleted, the information matrix with the 
corresponding rows and columns deleted is positive definite. 
A necessary and sufficient condition for the log-likelihood 
function to be concave and have a unique maximum is that this 
Hessian matrix (matrix of negative of expected value of second 
derivatives) is positive definite. In the limit, tnerefore, 
unique maximum likelihood estimates of the parameters exist. 
The information matrix can be written in partitioned form: 



I(c, a, 0) = 



A B" 
B C 



where 



A = 



I A n A 12 
1 A 21 A 22 



'11 



diag( S N qj W qj ) 



A 22 = di'ag ( S N^W^^'e) 2 ) 
B » j B, B 2 | [J N qj W qj k qg a j' I N qj W qj ( ^^ )k qg a j ] 



0 



Since X has a deficiency in rank of 2, two parameters can 
be arbitrarily fixed and the cor resoonding likelihood 

y 

equations deleted. As discussed earlier in this chapter, 
the first and ni 1 th effects from the first factor of the 
design are the parameters chosen to be fixed at -1 and +1 
respectively. This choice conveniently sets the scale of 
the solution in terms of the range of the first factor effects 
As' also discussed previously, one of the linea^dependencies 
can be eliminated by specifying a prior distribution on the 
slope paramters instead of arbitrarily fixing a second 
parameter. The dependency eliminated by the prior would be 
the one associated with the linearly independent coefficients 
0, 0, 0, . . \ 0, a lf a 2 , a 3 , . . . a Q , -0^ -G 2 , 

3/ s 

The inclusion of -the prior distribution does not change 
tne composition of the X matrix, but the X matrix is never 
actually formed during the estimation procedure. Tne infor- 
mation matrix is tormed in the the three partitions; A, B, 



and C. Tne last n columns ot the sucmatrix A are formed 

o * th 
by the inner products of tne n +1 through 2n 

•columns of tne X matrix. These columns are linearly 
dependent on the last s columns, of X. Now the prior distri- 
bution is included by adding tne matrix G, say, where 

1 - log a x + / M a 1 - log a 2 + / W flL 
G = diag (0, 0, 0, ... 0, j 2 ' 2 2 ' * 



1 - log a n + f4_ 

_ — _S 2, 0, 0, 0, ... 0) 

<T 4 a/ 
a 1 



to tne informtion matrix, which has the effect of adding a 

term to each of the last n diagonal elements of A, A being 

2n Dy 2n. The linear dependency among the last columns of A 

and the ottter rows (columns) of the information matrix is 

thus eliminated by the addition of the elements of G to 

the diagonal, and the additional row and column need not be deleted 

in this case. 

For the model with no prior distribution on the slope 

v 

parameters, two rows and columns corresponding to two group 
effects are deleted, and the information matrix will be 
positive definite. Then, 



* • 

A B 



I(c, a, 0) = 1 * * 
~~ I B C J 



is 



tne 2n + s - 2 



rank information matrix. ' If a prior distribution is specified 



A -iS 10 



jL 



for tne slopes , rows and columns corresponding to only one 

★ 

effect are deleted. I(c, a, 0 ) will still be positive 
definite, but the rank will be 2n + s - 1. 

For tne MAP estimation, the information matrix is 
adjusted when used in the Newton-Raphson iterations for 
the influence of the prior distribution, resulting in the 
matrix, say, E. 



E = I(c, a, 0*) "+ G 

where G takes the form as defined previously. 

). 

For the regular maximum likelihood estimates, no adjustment 
is made to the information matrix. 
S V* Proceeding to ootain the necessary quantities for 

the scoring solution, we need the inverse of the information 
matrix, or the information matrix as adjusted for the prior 
distribution. 



i- 1 - 

f «i - 1 * »* * - 1 * • - 1 -1 -1*' * - 1 * 1 

A + A B (C. - B*A X B ) X B*A x -A X B (C - B*A X B 

-(C - B*A X B ) 1 B*A A (C - B*A X B J 



There are some aspects of which can be used for 

efficient computing . Tne wnole matrix is of course Grammian 
so the upper right partition is simply the transpose of the 
lower left partition. The right nand term in the upper left 



A- \6 



partition, a" 1 B*'(C - B*A" 1 B*')" 1 B*A , is Grammian, and 

can be formed with specialized routines from the two matrices 

j*A~l and (C* - B*A~ 1 B*')~ i , which is the lower right 



partition. The matrix a does not require heavy computation 
because the matrix a consists of partitions which are diagonal 



-1 



i A 11 A 12 
! A 21 A 22 



% 



wnere 



A 11 - D^A 



22 



„ 21 _ .12 = 



A* 2 = D^A 



11 



The largest matrix to be directly inverted is the s - 2 

* * - 1 * • 

rank (C - B A X E ) . 

For the MAP estimates, i" 1 is replaced by e" 1 . 

e" 1 is the same as I -1 except for the contents of A 11 and D. 
So, if the prior distribution is specified on the 3lop v es, 

A* 1 and D become as follows: 



a ll n -l A +d . , l-loga 1+ ^ a l_:_l°L!2.!. / la 

A = D A 22 diag ( ^ 2 ' 2 2 ' 

CT a a l ^a a 2 



1 - log a 3 + A^ a 

2 2 
0" a a n 



) 1 



A-i7 



10 



D 



, 1 - log a . + M 

"ag[(S N qj w qj)( s ^.(^0) + -_ r J T 

», a 3 

(S N .W . (k '©) 2 ) ] 

q qD qj -q - 



2) - 



Here, (C - B A B ) is of rank s - 1. 



The Newton- raphs on procedu re consists of finding 
estimates at the t+l th iteration by adding a correction 
to the estimates at the t th iteration. The 'correction 
is obtained from multiplying the inverse of the matrix 
of- second derivatives by the matrix of first derivatives. 
If tne information matrix, which contains expected values 
for tne second derivatives, is substituted for the matrix 
of actual second derivatives, the iterative proceedure with 
this substitution is known as Fisher's method of Efficient 
Score : 



0 



S B 



8 B^lk, 2) 



In extremum theory, the vector of first derivatives is 
known as tne gradient. The maximum of the likelihood function, 
exists at tne zero of the gradient. The second derivatives 
tellcnow .fast tne gradient is changing. As the gradient 
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1 o-i 



approaches its zero, it will change faster and faster, and the 
elements of the inverse of the information matrix will become 

\smaller and smaller. So f at the maximum 6f the likelihood 

i 

function, the correction to De added to the estimates becomes 
zero. The iterative process is stopped and considered 
converged whenever the absolute value for all corrections 
falls below a preassigned criterion. Starting values of 
Os and Is, for the.c^'s and a^'s respectively, have been used 
with success. *Least squares estimates of the group effects 
calculated on the cell proportions can be used as 
starting values for 0 . 



Asymptotic Properties 

Many of the traditional results which hold for 
maximum likelihood estimates are useful here. Since the 
only parameters associated with the subjects are fixed 
group effects, this model avoids w one of the thorniest 
problems often' encountered by two parameter latent trait 
models. In such a model where each subject has an ability 
to be estimated, the subject parameter, which appears as 
a niusance parameter, cannot be conditioned out of the 
likelihood equations, and the number of parameters increases 
with the number of respondents. In the present model, 
however, the number of parame\ers are fixed even as the 
number of respondents becomes very large. Hence, standard 
results that are covered in general treatments su6h as 



1 y 3 



Cramer (1946) and Rao ( 1965), apply • 

Maximum likelihood estimates have the properties of 

consistency and asymptotic efficiency, the latter meaning 

tnat tne variance of the estimates is the "minimum „ 

attainable by any consistent estimator. Additionally, 

the estimates are distributed in multivariate normal form, 

with var iance-co variance matrix equal to the inverse of 

the negative of the matrix of second derivatives, i.e., 

the information matrix. This, information measure, ..also 

* ..- 

known as Fisher's information, proves to be a general 

index of sensitivity for small changes in the value of 

the°pararaeter (Rao, 1962),-- 

The standard errors for the estimates are formed • 

from elements of the information matrix as follows: 

S.fi. (c . ) = 1/SORT (Z~\ ) ' 

S . E . ( a . ) = I'/ S CRT ( X-' n ij - 1 • ) 

S.E.(0 g ) = l/SQRT(i;^.^^.^ 

Fortunately, the terms needed for the denominators of 
the expressions can be taken directly from the information 
matrix as formed during the last iteration of the- scoring 
procedure. 

Testing Goodness of Fit 



Two statistics which are commonly used as a measure 

" , / 
of the distance .between the model and the data are the 

2 

likelinood ra tio-^qhi-sguare, sometimes written as G , 

2 

ana tne Pearson chi^sguare , t sometimes written as X . 
G 2 and X 2 are defined as follows: 



2 n f 2 : \ ' r 9ih 
= 2 S S S r . log SI 

where h is over all responses to an item, 



% 2 . » i i H 2i .:.« 2 i! a £! 



The degrees of freedom for these statistics are equal 

2 

to nf - 2n - s + 2. In practice, the values of G 

ana X 2 are essentially tne same for a given model and 

data set, altnougn G may be more resistant to ill 

eftects of cells witn very low expected value. It 

2 • 

can be shown quite readily that X is a sum 

of squares of approximate unit normal deviates, and 

has, therefore, approximately a chi-square distribution 

on nf - 2n - s + 2 degrees of freedeom (see for example, 

Brownlee (1965)). Bishop, Fienberg, and Holland 

» 

(1975) show that G 2 and X 2 are asymptotically equivalent 

/ 2 
under the cor rect' model for- the data* G has the 

overwhelming advantage that it can be used for comparing 

alternative nested models using a conditional breakdown 

of the chi-squar* measures for the models., 
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INTRODUCTION 



The 2 -parameter logistic item ^response model expresses the 
probability of a correct response to Item j from Agent i as 

"ei - & j 



Pij = Y 

where 



0 j J 



(1) 



4f (x) denotes the logistic function exp(x)/[l+exp(x) ] , 
ei._is the ability parameter for Agent i, 
Bj is the threshold parameter of Item j, and 
aj is the dispersion parameter of Item j (the recip- 
of the slope parameter of Item j). 
Reiser's (1980) latent trart model for group effects follows 
this form, with "Agent i" interpreted as' the group of subjects in 
a specified cell of the NAEP demographic sampling design, and with 
©i being a linear function of a vector of group-effect parameters. 

Item and group parameters are determined uniquely only up to 
a linear transformation. When subsets of items from the same 
scale are calibrated in separate data sets (e.g. f data from 
different assessment years or different age groups), linear 
transformations must be found which optimally rescale item and 
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group-parameter estimates from any given calibration to a common 
scale with a specified origin and unit-size. 

The method of linking calibrations described in this paper is 
intended for the case in which two or more calibration runs have 
been performed on independent sets of data. In each case, all 
items are assumed to belong to the same scale. It is necessary 
that each calibration contain at^least two items that appear in 
some other calibration, and that all calibrations are linked 
either directly or indirectly. (Calibrations k and / are linked 
direct ly if they have items in* common ; " they are linked iridi rectly 
if Calibration j shares items with Calibration h, which in turn 
shares items with Calibration Any^ such chain of finite length 

constitutes an indirect link.) The method utilizes information 
from all links among all calibrations in the estimation of 
optimal transformations to a common scale. 

SETTING UP NOTATION 
We concern ourselves with item and group parameter estimates 
from M separate calibrations. Item parameter estimates are 
denoted as follows: 

Bjk is the estimate of the threshold parameter of Item 
j from Calibration k r if Item j has been included in 
Calbration k; otherwise, this value is undefined; 
Sjk is the estimate of the dispersion parameter of Item j 
from Calibration k, if Item j has been included in that 
calibration run . 
Oik is the estimate of the ability of Group i obtained 
in Calibration k, if appropriate. 
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The linear transformations we seek will, for convenience, 
rescale the estimates from all other calibrations to the scale 



determined in Calibration 1. They are denoted as follows: 

Lk(x) = Ak x + Ck. 
They are applied to the estimates as follows: 

Gik* = Ak Oik* + Ck, 

Bjk* = Ak Bjk* + Ck, and 

Sjk* = Ak Sjk. 

It is clear that each item will have at least two estimates of 
each of its parameters, after these transformations have been 
applied to the results from each calibration. Inasmuch as the 
transformations represent optimal rescaling to a common unit and 
origin, final estimates of item parameters may be obtained by 
taking the averages of the estimates for a particular value, with 
each estimate weighted by the squared reciprical of its rescaled 
standard error of estimation. 

THE FITTING FUNCTION 
The weighted least-squares fitting function that 
simultaneously estimates the transformations for Calibrations 2 
through M, using information from all available links, is shown 
below. It is to be understood that A ^ is fixed at 1 and C^^ at 0. 



F = 



N M M r r 

E E E (A. 
j=l k=l £>k[L 



w. 



jk£ 




W 
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where 



Wjkjf « 



W*jkj( = 



2 2 2 2 -1 - 

{Sqrt[Ak SE (Bjk) + A/ SE (Bj/)]} if Item j is 

included in both 
calibrations, 

0 otherwise; 

2 2 2 2 -1 

{SqrttAk SE (Sjk) + A/ SE (Sj/)]} if Item j is 

included in both 
calibrations , 



0 otherwise, 
A computational method for obtaining the minimum of the 
fi tiring f unc t i on* may begi n ^i th* an ijn 

approximation which uses information from threshold estimates 
only, as described in the following section. Ln this section, we 
provide approximate first and second derivatives of the fitting 
function with respect to the parameters of the transformations, 
which may be used in a quasi-Newton solution. Given approximations 
A ^ and C^of the parameters of the transformations, we obtain 



better approximations as follows: 



(m+1) (m) 



2 

d F 



2 

d F 



dA dA dA dC 



2 

d F 



2 

d F 



dC dA dC dC 



-1 



(m) 



dF 
dA 

dF 
dC 




[m) 



The presence of the slope parameters Ak in the weights 
complicates the computation of derivatives. We propose, 
therefore, that during the iterative solution of this problem, the 
weights be considered as constants at each step. That is, during 



the computation of the (m+D'th estimates, the weights are to be 
computed from the known values of the standard errors -of the item 
parameter estimates and the transformation slope parameter 
estimates Ak obtained from the m'th step. This expedient can be 
expected to have little effect on the efficiency of the solution. 
Under this assumption, we obtain the first and second derivatives 
of the fitting function F as shown below. It is to be understood 
that these derivatives are for transformations 2 through M. 
F I r st derivatives 



N M 



Ak: 2 I E [(\B jk 2 - A £ B jk B jt + C k B jk - C £ B j£ ) W jk£ 
j=l 1=1 L J 



+ ( A k S jk 2 - Vjk S j^ W *jk£ 



] 



Ck: 2 S S (A k B - A Z B + C k - C £ ) W jka 
j=l 1=1 J 



\ 



Second derivatives 



Ak, AjC: -2 



N 


r 2 M 


z , 

j=l 


[> ii 






j x [ B jk B j£ W jk* 


N 


M 


E 
j=l 


B E W. 

L :k i=i : 






N 




E 
j=l 


B j*Vr 



M 



(J#k) 
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N M 

Ck, Ck : 2 Z Z W.- { 
j=l 5, =1 ]K * 

am 

N 

Ck, CI: -2 Z W ... 

j=l J kA 

AK UNWEIGHTED LEAST-SQUARES APPROXIMATION 

An unweighted solution using information from item threshold 
estimates only may be obtained by redefining the weight terms in 
the -fitting function F, First, all weights relating to item 
dispersion terms, W*jk/, are set to zero. Second the heights 
relating to thresholds are replaced by simple indicatory variables: 



1 if Item j is included in both calibrations, 

Djkjf = 

0 otherwise , 

FINAL ESTIMATES OF ITEM PARAMETERS 
The transformations determining the minimum of the fitting 
function will take group-effect estimates to the common scale 
defined by the first calibration. Item parameters may also be 
rescaled accordingly. Each item will have at least two estimates 
of its threshold and dispersion, accompanied by rescaled standard 
errors of estimation, (The standard errors of a rescaled item 
parameters is simply the standard error from the calibration 
run, multiplied by the appropriate transformation parameter Ak,) 
To obtain a single point estimate of a giVen parameter, one may 
take the average of the several estimates, each weighted by the 
squared reciprocal of its rescaled standard error. 

With either the weighted or the unweighted solution, one may 
obtain an approximate Chi-square value to test the hypothesis that 
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all estimates of a given item parameter are equivalent within the 
ranges of calibration error. For example, the Chi-square for the 
equality of the several estimates of the threshold of Item i is 
given by 

2 



M 



k=l 



jk 



(B jk * - B jr *) 



where 6jk=l if Item j was included in calibration k and 0 if not. 
The number of degrees of freedom for this quantity is the count of 



appearances of the item in all calibrations, minus one.. 

A test of fit for the entire set of linking transformations 
may be obtained by summing quantities as defined above, over all 
items and both thresholds and dispersions, with degrees of freedom 
similarly summed. 
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A COMPUTER PROGRAM FOR LINKING CALIBRATIONS 
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A FORTRAN IV COMPUTER PROGRAM FOR LINKING ITEM CALIBRATIONS 
(SOURCE CODE AND EXAMPLE FROM 'UNDERSTANDING MATHEMATICAL CONCEPTS') 



1 . 
2. 

3 , 

4 . 
5. 
6. 
7. 

8 . 

9 . 
10. 
1 1 . 
12. 
13. 
14. 
15. 
16. 
17. 
18 . 

~T9T~ 
20. 
21 . 
22. 

23 . 

24 . 
25. 
26. 
27 . 
28. 
29. 
30. 
31 . 
32. 

33 . 

34 . 
35. 

36 . 

37 . 
38. 
39. 
40. 

41 . 

42 . 
43. 
44/ 
45. 

46 . 

47 . 



//CONCEPT JOB (8UZ303,NAEP,M),MISLEVY,RE=280K,TE=Y 
// EXEC FORTGCLG ,USERLIB='SYS2.MATCAL' * 
//FORT.SYSIN DD * I 

IMPLICIT REAL*8 (A-H.O-Z) 

REAL*8' INAME(20) 

COMMON/ P A RCOM/ I NAME , ACR I T , N , M , NM , METHOD , M 1 , NP , NTR I S , NTRI L 
$ IDI AG , MAX ITR 

NAMEL I ST/INPUT/ ACR IT, N,M, METHOD, IDI AG, MAXITR, INAME 



C 
C 
C 
C 
C 
C 
C 

c 
c 
c 



METHOD OF SOLUTION: 

1. UNWEIGHTED LEAST SQUARES ~ 

2. WEIGHTED, THRESH INFO ONLY 

3. WEIGHTED, THRESH 8. DISP INFO 

N = TCtAL * ITEMS 
M = # CALIBRATIONS 

NP = # PARAMETERS TO BE ESTIN|AJED, 
2*(M- 1 ) . 



METHOD =0 " 

MAXITR=10 
ACRIT=.001 
READ(5, INPUT) 

M1=M~1 

NP=2*M1 

NM=N*M 

NTRIS=(M1*(M1+1 ))/2 

NTRIL=(NP*(NP+ 1 ) )/2 

CALL COMPUT 

STOP 

END 

SUBROUTINE COMPUT 
IMPLICIT REAL +8 (A-H.O-Z) 
REAL*8 INAME ( 20 ) 

COMMON/PARCOM/INAME , ACR IT, N.M.NM, METHOD, M1 ,NP, NTRI S, NTRI L, 

$ IDIAG, MAXITR 

DIMENSION INCID(20,4).B(20,4),BSE(20,4),S(20,4) ,SSE(20,4), 
$ RSCSL0(4) RSCINT(4) , PAR AMS (6 ) , CHANGE (6 ) ,FDRV(6) , 
$ SDRV(21 ) , KDELTA (20,4,4) ,WB( 20. 4,4) ,WS(20,4,4) , 

$ AVEB(20KAVES(20K^ 

$ STNRES(80) .SLOPE (20.4). SLOSE( 20.4), AVESLO( 20). AVSLSE( 20) 
REALM FMT(20) 



DO 10 K=1 ,M 

RSCSL0(K)=1D0 



N, 
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48 . 
49. 
50. 
5 1 . 
52. 
53 . 
54. 
55*. 

56 . 

57 . 

58 . 
59. 
60. 
61 . 
62. 

63 . 

64 . 

65 . 

66 . 

67 . 

68 . 
69. 
70. 
7 1 . 

—7-2-r- 



73. 
74. 

75 . 

76 . 

77 . 
78. 
79. 
80. 
81 . 
82. 
S3. 

84 . 

85 . 

86 . 

87 . 
88. 
89. 
90. 
91 . 
92. 
93 . 
94. 

95 . 

96 . 
97. 
98 . 
99. 



RSCINT(K)=0D0 

DO 10 J=1,N 
INCID( J,K)=0 
B (J.K)-ODO 
BSE ( J,K)=ODO 
S ( J.K^ODO 
SSE , ( J,K)=ODO. 
SLOPE( J,K)=ME69 
SLOSE( J,K)/=1E69 



WB 
WS 

10 CONTINUE 



J , KJ^= 1 E 
DO 10LAM s 
KETfLT A ( J , K , L) 



(J.K 
( J.K 



0 

L)=ODO 
L)=ODO 



RE AD ( 5 , 15)FMT 
15 FORMAT ( 20A4 ) 

DO 20 1 = 1 999999 

READ (5 , FMT , END ~2 1 ) J , K , THR . THRSE . DI SP . D I SPSE 
INCID( J,K)=1 
B (J,K)=THR 
BSE( J,K)=THRSE 
S (J,K)=DISP 
SSE^-J^K-^D-I-SPSE ~_ 



C 

c 
c 
c 
c 



20 
21 



22 
25 

30 



50 



60 



CONTINUE 

IF(IDIAG.LE.O) GOTO 30 
DO 30 K= 1 , M 

WRITE(6,9040) K 

DO 25 J=1 ,N 

IF(INCID( J,K) .EO.O) GOTO 22 

WRITE (6, 9060) J . B ( J . K ) , BSE ( J . K ) . S ( J . K ) . SSE ( J , K ) 
CONTINUE 
CONTINUE 
CONTINUE 



SET UP KRONECKER DELTA MATRIX; 
KDELTA ( J , K f L ) = 1 IF ITEM J APPEARS 
FOR BOTH CALIBRATIONS K & L, 
=0 IF NOT. 



CONTINUE 
DO 60 J=1 ,N 
00 60 K-1 ,M 
DO 60 L=1,M 

IF(INCID( J.K) .EO. 1 -AND. 
WB ( J , K , L ) =KDELTA( J , K , L ) 
CONTINUE 

IF(IDIAG.LT.2) GOTO 70 
DO 70 J-1 .N 

WRITE(6,6000) J.(L,L-1.M) 
DO 65 K=1 ,M V 

WRITE ( 6, 6100 )( K, ( KDELTA ( J^K.L) ,L=1 ,M)) 



INCID(J.L) .EO. 1) KDELTA ( J , K , L ) - 1 
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100. 
101 . 
102. 
103. 
104 . 
105. 
106. 

107 . 

108 . 
109. 
1 10. 
111. 
1 12 . 
113. 
114. 
1 15. 
1 16. 

0117. 
1 18. 
1 19. 
120. 
121 , 
122. 
123. 
12A- 



125. 

126 . 

127 . 
128. 
129. 
130. 
131 . 
132. 
133. 
134 . 
135. 
136 . 
137. 
138. 
139* 
140. 
141 . 
142. 
143. 
144 . 
145. 
146. 
147. 
148. 
149 . 
150. 
151 . 
152, 



^65 
6000 
6100 
70 



CONTINUE 
FORMAT( '-KDELTA MATRIX , 
F0RMAT(22X,4X,21I3) 
CONTINUE 



ITEM' , I4.5X.20I3) 



100 



103 



ICYCL^-1 

ICYCL = ICYCLM 

DO "700 ITR = 1,MAXITR 

IF(IDIAG.GT.O) WRITE(6,9000) ICYCL.ITR 
DO 103 1=1, Ml 

PARAMS(I)=RSCSL0(I+1) 
PARAMS ( I +M 1 )=RSCINT( 1+1 ) 
CONTINUE 



C 

c 
c 
c 



150 



160 



700 
710 



IF WEIGHTED SOLUTION, 
COMPUTE WEIGHTS. 



I F ( METHOD . GE . 1 .AND. 

CALL WE IGHT ( KDELTA 
If (METHOD. GE. 2 .AND. 



ICYCL.GT .0) 
.RSCSLO, BSE, WB) 
ICYCL.GT. 0) 



CALL WE IGHT ( KDELTA , RSCSLO , SSE , WS ) 



CALCULATE DERIVATIVES 



"CALL FIRST (bTbSE ,S, SSE .RSCSLO. RSC INT, WB,WS, FDRV) 
CALL SECOND (B, BSE, S, SSE, RSCSLO, RSC INT, WB, WS, SDRV ) 

NEWTON-RAPHSON STEP 



1 ) 



CALL INVSD( SDRV, NP.DET, WORK 1 , W0RK2 ) 
CALL MPYM( SDRV, FORV, CHANGE, NP.NP, 1,0, 
BIGC=ODO 
BIGD=ODO 
DO 1 50 I = 1 , NP 

CHANGE! I )=-CHANGE( I ) 

IF(DABS (CHANGE ( I ) ) .GT.BIGC) BI GC=DABS ( CHANGE ( I ) ) 
IF(DABS( FDRV( I ) ) . GT . BIGD ) BI GD=DABS( FDRV ( I ) ) 
CONTINUE 

CALL* ADDM(PARAMS. CHANGE, PARAMS, NP, 1 ,0) 
DO 160 1=1 ,M1 

RSCSLO ( 1+ 1 ) =PARAMS ( I ) 

RSCINTU+1 )=PARAMS(I+M1) 
CONTINUE 

CALL FUNCT (RSCSLO, RSC INT, B.S.WB.WS, CHI SO) 
WR I T E ( 6 , 9020 ) ICYCL , ITR.BIGC.CHISO 

IF(IDIAG.GT.O) CALL DPRNT ( RSCSLO , 1 , M , 0 , 8HSL0PES ) 
IF(IDIAG.GT.O) CALL DPRNT ( RSC INT , 1 , M , 0 , 8HINTERCPT ) 
IFUBIGC.LE.ACRIT .OR. BIGD.LE.ACRI7) .OR. 

(METHOD. GT. 2 .AND. ICYCL. EO.O .AND, 
CONTINUE 

CALL DPRNT(RSCSLO, 1 , M , 0 , 8HSL0PES ) 
CALL DPRNT (RSC INT, 1 , M , 0 , 8HI NTERCPT ) 



ITR.GE.3)) GOTO 710 
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CALL 0PRN7(S0RV,NP,NP, 1 , 8HC0VARNCE ) 



IF WEIGHTED SOLUTION DESIRED , 
AND JUST UNWEIGHTED SOLUTION 
HAS BEEN COMPUTED, GO BACK 
AND DO WEIGHTED SOLUTION. 

IF (ICYCL.EQ.O .AND. METHOD. GE.1) GOTO 100 

s RESCALE ITEM PARAMETERS 



DO 765 d=1 ,N 

AVEB ( J) =000 

AVES (d)=000 

AD JB (d)=000 

Ar-JS ( J) =000 

A ;bse(j)=ooo 

AVESSE( d)=ODO 
765 CONTINUE 

DO 780 K=1 ,M 

WRITE(6,9050) K 
DD 775 J=1 ,N 

IF(INClD(d,K) .EQ.O) GOTO 770 



B (d,K)= B ( d,K)*RSCSLO(K) + RSCINT(K) _ . - ■ — 

BSE(d,K)= BSE( d,K)*RSCSLO(K) 

S (d,K)= S (d,K)*RSCSUQO<) 

SSE(J t K)= SSEi J. r K )^ R SC S LD ( K ) 

SLOPE ( J , K')-=TbO/ S ( J , K ) 

SL0SE(d,K)=SSE(d,K)*SL0PE(d,K)**2 

WGTB=ODO 
WGTS=ODO 

IF(BSE(d,K) .GT.ODO) WGTB = 100/ ( BSE ( d , K ) * * 2 ) 

IF(SSE(d,K).GT.ODO) WGTS = 100/ (SSE(d,K)**2) 

AVEB(d)=AVEB(d)+B(d,K) *WGTB 

AVES(d)-AVES(d)+S(d,K)*WGTS 

ADdB ( d ) =ADdB ( d ) +WGTB 

AOdS(d)=ADdS(d)+WGTS 

AVEBSE ( d ) =AVEBSE ( d ) + ( WGTB*BSE ( d , K ) ) * *2 
AVESSE(d)=AVESSE(d) + ( WGTS*SSE ( d , K ) ) **2 
WRITE (6 ,9060) d,B(d,K),BSE(d,K),S(d,K),SSE(d,K). 
... $ SLDPE(d,K),SLDSE(d,K) 
770 CONTINUE 
775 CONTINUE 
780 CONTINUE 

WRITE(6,9070) 
DO 830 d=1 ,N 
AVESLD(d)=ODO 
AVSLSE(d)=ODO 

I F ( ADdB ( d ) .GT .000) A VEB ( d ) =A VEB ( d ) /ADdB ( d ) 
IF(AOdS(d) .GT.ODO) AVES(d)=AVES(d)/ ADdS(d) 

IF(ADdB(d) .GT.ODO) AVEBSE(d)=( 1DO/A0dB ( d )) *OSQRT ( AVEBSE ( d ) ) 



212. C 
213. 



206 IF(ADJS(J) .GT.0D0) AVESSE(J) =DSQRT ( 1D0/ AVESSE(J)) 

207 IF(AVES(d).NE. ODO) A VESLO ( d ) = 1 DO/A VE S ( J ) ^ 
208* I F ( A VES ( d ) . NE . ODO) A VSLSE ( d ) =A VE SSE ( d ) * A VESLO ( d ) * * 2 

209 WRITE(6,90S0) d , A VEB ( d ) , A VEBSE ( d ) , A VE S ( d ) , A VE SSE ( d ) , 

21o! $ AVESLO(u),AVSLSE(d) 

211. 830 CONTINUE 

£ 11 , STANDARDIZED RESIDUALS 

214. ' C 

215. IDF=0 

216. DO 840 K = 2,M 

217. * K1=K-1 

218. DO 837 L=1 ,K1 

2 19 DO 834 d=1 ,N 

220. ' IDF=IDF+KDELTA(d,K, L) 

221 . 834 CONTINUE 

222. 837 CONTINUE 

223. 840 CONTINUE 

224. C 

225. CHISQ=ODO 

226. DO 850 d = 1 , N^ 

227 . DO 845 K=1 ,M 

■228. , INDX=d + (K-1)*N 

229 STNRES( INDX ) =0D0 

.230 m- XE-.U.NCJ.D„( d . K.) . LE.. 0 ) GOTO 845 ... ' , 

231 j STNRES(INDX)-(B(J,K)-AVEB(J))/DS0RT(AVEBSE(J)**2+BSE(J,K)**2) 

232! CHISO=CHISO + STNRES(INDX)**2 

233. 845 CONTINUE 

234 850 CONTINUE 

235. CALL DPRNT ( STNRE S , N , M , 0 , 8HRES IDUAL ) 

236. WRITE(6,800q) \ 

237 . DO 900 d=1 M X 
238 WRITE(6,8/00) I NAME ( J ) , 

239* % (B( J,K) ' BSE(d.K) , SLOPE ( J.K) ,SLOSE( J,K) ,K=1 ,4) , 

240 ! $ AVEB(d) . AVEBSE(d) .AVESLO(d) . AVSLSE(d) 

24 1 '. 900 CONTINUE 

243! ^8000 FORMAT ( 1 H 1 // 1 ITEM '.5(' THRESH SE SLOPE SE » )/ 

244 $ 1X,31(4H )) 

245 8100 F0RMAT(1X,A8,5(F7.2,F5.2.FS.2,F5.2)) 

246. 9000 FORMAT ( 1 H 1 // ' CYCLE 1 . M,' I T ERAT I ON ' , I 4 ) 

247 9020 FORMAT ( 1H .2I4.2F12.6) 

248 9040 FORMAT ( 1H V/ ' - INPUT ITEM PARAMETERS FOR CALIBRATION 14// 
249. $ ' ITEM THRESHOLD S.E. DISPERSION S.E. V 

250 $ ' *" ' ^ 

251* 9050 FORMAT ( 1 H 1 / ' -RE SCALED ITEM PARAMETERS FOR CALIBRATION 1 ,14/ / 

252 $ 1 ITEM THRESHOLD S.E. DISPERSION S.E. 

253. $ : ' SLOPE S.E. 7 

254. „ % ' ;:~ 

255. $ ' ) 

256. 9060 FORMAT ( 1 X , I 4 , 6F 1.0 . 3 ) 

257 9070 FORMAT ( 1 H 1 // ' -GRAND AVERAGES OF ITEM PARAMETERS"/ 

258 $ * ITEM THRESHOLD S.E. DISPERSION S.E. 

/? 

■ . u 
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X 



259. $ 1 SLOPE S.E. 7 * 

260. $ ' ' ■ 

261. $ ' ') 

262. RETURN 

263 . END 

264. SUBROUTINE WE IGHT ( KDELTA , RSCSLO , SE , W) 

265. IMPLICIT REAL*8 (A-H,0-Z) 

266. REAL*8 INAME(20) 

267 COMMON/PA RCOM/ INAME , ACR I T , N , M , NM , METHOD , M1 , NP , NTR 1 S , NTR I L , 

268 $ IDIAG.MAXITR 

269. DIMENSION KDE LTA,( 20 , 4 , 4 ) , RSCSLO ( 4 ) , SE ( 20 , 4 ) , W( 20 , 4 , 4 ) 

270. C . ' 
27 1 . DO 400 K=1 ,M , 

272. - K1=K-1 ' 

273. * DO 300 L*1 ,K1 

274 . DO 200 J=1 ,N 

275. W(J,K,L)=ODO 

276 I F ( KDELTA ( J , K , L ) .LE.O) GOTO 100. 

277. W(J,K,L)=1D0/((RSCSL0(K)*:SE( J.K) )**2+(RSCSL0(L)*SE( J f L) )**2) 

278. 100 W(J,L,K)=W(J,K,L) ^ _ '* " 

279. 200 CONTINUE 

280. 300 CONTINUE 

281 . 400 CONTINUE 

282. IF( IDIAG. LT .2) G0T0570 

283. DO 570 J=1 ,N 

284. . WRITE(6,6000) J,(L,L=1,M) . 
285 DO 565 K=1 ,M f ' 

286. WRITE(6,6100)(K, ( W( J , K , L ) , L= 1 ,M)) ^ s 

287. 565 CONTINUE ' 
288., 6000 FORMAT( '-WEIGHT ^MATRIX, I TEM 1 , 1 4 , 1 01 1 0 ) \ 

289. 6100 FORMAT ( 1 2X , 4X , 1 10. 10F 10. 4 ) 

290. 570 CONTINUE 

291. C o 

292. RETURN 

293. END 

294. SUBROUTINE FUNCT ( RSCSLO , RSC I NT , B , S , WB , WS , CH I SO ) 



COMPUTE CHI-SQUARE 



IMPLICIT REAL*8 (A-H.Q-Z) 



295. C 

296 . C 

297 . C 

298 . 

299 R E A L * 8 INAME(20) 

300 COMMON/PARC OM/ INAME , ACR I T , N , M , NM , METHOD , M 1 , NP , NTR I S , NTRI L , 

301 $ IDIAG.MAXITR 

302. DIMENSION RSCSLO ( 4 ) , R SC I NT ( 4 ) , B ( 20 , 4 ) , S ( 20 , 4 ) , 

303. $ WB(20,4,4) ,WS(20,4,4) 

304. C * 

305. CHISO=ODO . 

306 . DO 500 K=1 ,M 

307. K1»K-1 

, 308 . DO 400 L=1 ,K1 

309-. IF(KI.LE.O) GOTO 400 

310 * DO 300 J-1 ,N 

311. IF(WB(J,K,L)LE.ODO) GOTO 300 
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312. 
313. 
314. 
315. 
316 . 
317. 
318 . 
319. 
320. 
321 . 
322. 

323 . 

324 . 
325. 

326 . 

327 . 

328 . 
329. 
330. 
331 . 
332. 
333. 
334 . 
335. 

336 . 

337 . 

338 . 

339 . 
340. 

341 . 

342 . 

343 . 

344 . 

345 . 
346. 

347 . 

348 . 

349 . 
350. 

351 . 

352 . 

353 . 

354 . 

355 . 

356 . 

357 . 

358 . 
359. 
360. 
361 . 
362. 
363 . 
364. 



CHIS0=CHIS0 + (RSCSL0(K)*B(J.K) + RSCINT(K) " 
$ -RSCSL0(L)*B( J,L) - RSC INT ( L ) ) * *2 * WB(J,K,L) 

I F (METHOD . LT . 2 ) GOTO 300 
CHIS0=CHIS0 + (RSCSL0(K)*S( J,K) 
$ -RSCSL0(L)*S(J,L))**2 * WS(J,K,L) 

300 CONTINUE 
400 CONTINUE 
500 CONTINUE 
RETURN 
END 

SUBROUTINE F I RST ( B , BSE , S , SSE , RSCSLO , RSCINT , WB , WS , FDRV ) 
IMPLICIT RE AL*8 (A-H.O-Z) 
REAL*8 INAME(20) 

COMMON/PARCOM/ INAME , ACR I T , N , M , NM , METHOD , M 1 . NP , NTR I S , NTR I L , 
$ I DI AG , MAX ITR " 

DIMENSION B(20,4),S(20,4),BSE(20,4),SSE(20,4),RSCSL0(4),RSCINT(4) 
$ WB(20,4,4),WS(20,4,4),FDRV(6) 

FIRST DERIVS OF SLOPES 

DO 190 K=2,M 
KK-K-1 

FDRV(KK)=ODO 
DO 1 70 J= 1 , N 
DO 150 L-1 ,M 

IFU.EQ.K .OR. WB( J.K.L) .LE .ODO) GOTO 150 

FDRV(KK)=FDRV(KK j + WB(J.K.L)* 0 
$ (RSCSLO(K)*B( J,K)**2 - RSCSLO ( L ) *B ( J , K ) *B ( d , L ) 

$ + RSCINT(K)*B( J,K) - RSC INT ( L ) *B ( J , K ) ) 

IF (METHOD. LT. 2) GOTO 150 
FDRV ( KK )= FDRV ( KK ) + WS(J,K,L)* 
$ (RSCSLO(K)*S( J,K)**2 - RSCSLO ( L ) * S ( J , K ) * S ( J , L ) ) 

150 CONTINUE 
170 CONTINUE 

FDRV(KK)= FDRV(KK) '* 2D0 
190 CONTINUE 

FIRST DERIVS OF INTRCPS 

DO 290 K=2,M 
KK=M 1 + (K-1) 
FDRV(KK)=ODO 
DO 270 J=1 ,N 
DO 250 L-1 ,M 

IF(L.EO.K .OR. WB(J,K,L).LE..ODO) GOTO 250 c 
FDRV(KK)=FDRV(KK) + WB(J.K.L)* 
$ (RSCSLO(K)*B( J.K) - RSCSLO< L ) *B( J , L ) 

$ + RSCINT(K) - RSCINT(L)) 

250 CONTINUE 
270 CONTINUE * - 

FDRV(KK ) - FDRV(KK) * 2D0 \ 
290 CONTINUE / 

IF(IDIAG.GT.O) CALL DPRNT ( FDRV r 1 . NP . 0 , 8HFDRV* ) 
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365. 
366, 
367. 
368 . 
369. 
370. 
371 . 
372. 
373. 
374 . 
375. 
376 . 
377. 
378 . 
379. 
380. 

381 . 

382 . 
383. 

384 . 

385 . 
386. 

387 . 

388 . 

389 . 
390. 

391 . 

392 . 
393. 

394 . 

395 . 
396. 

397 . 

398 . 
399. 
400. 
401 . 
402. 
403. 
404 . 
405. 
406. 
407. 
408. 
409. 
410. 
411. 
4^2. 
413. 
414. 
415. 
416. 
417 . 



RETURN 

END ^ , 

SUBROUTINE SECOND ( B , BSE , S . SSE , RSCSLO , RSC I NT , WB , WS , SDRV ) 
IMPLICIT REAL*8 (A-H.O-Z) f| 
REAL*8 INAME(20) 

COMMON/ PA RCOM/ 1 NAME , ACR IT , N , M t NM , METHOD , M 1 , NP , NTR I S , NTR I L , 
$. dDI AG , MAX ITR 

DIMENSION B(20 t 4) , 5 ( 20 , 4 ) , BSE ( 20 , 4 ) ,SSE(20,4) ,RSCSL0(4) ,RSCINT(4) , 
$ WB (20 ,4, 4) ,WS(20,4, 4) , SDRV( 21), SDRVA ( 6) , 
$ SDRVB(9) , SDRVC(6) 

SLOPE DOUBLE DERIVS 

INDX=0 

DO 190 K=2,M 

INDX^INDX + (K-1 ) 
SDRVA( INDX)=ODO 
DO 170 J=1 ,N 
SUMWB=ODO, 
SUMWS=ODO 
DO 150 L=1 ,M 

IF(L.EO.K .OR. WB(J.K.L).LE.ODO) GOTO 150 
SUMWB=SUMWB + WB(J,K,L) 
SUMWS= SUMWS + WS(J,K,L) 
150 CONTINUE 

SDRVA (INDX)=SDRVA( INDX)+ SUMWB*B ( J , K ) * * 2 
I F ( METHOD . GE . 2 ) 
$ SDRVA(INDX)=SDRVA(INDX) + SUMWS* S ( J , K ) * * 2 

170 CONTINUE 

SDRVA( INDX)=SDRVA( INDX) * 2D0 
19.0 CONTINUE 



250 



270 



SLOPE CROSS DERIVS 

INDX=0 

DO 290 K=2,M 
DO 270 L=2,K 
INDX=INDX+1 

IFU.EQ.K -OR. WB(J,K,L) .LE.ODO) GOTO 270 
SDRVA( IND*) = ODO 

DO 250 J=1 ,N v 

SDRVA(INDX)=SDRVA(INDX) + B ( J , K ) *B ( J , L ) * WB ( J , K , L ) 
IF (METHOD. LT. 2) GOTO 250 

SDRVA(INDX)=SDRVA(INDX) + S(J,K)*S(J,L)*WS(d,K,L) 
CONTINUE ' 

SDUVA(INDX) = - SDRVA(INDX) * 2D0 
CONTINUE 



290 CONTINUE 



SLOPE*INTRCP CROSS DERIVS. 
SAME CALIBRATION 



12 4 



4 18. 




4 19. 




420 . 




4 21. 




4 22. 




423 . 




424 . 




425 . 




426 . 




427 . 




4 28 . 




4 29 . 




430 . 




431. 




432 . 


C 


433 . 


C 


434 . 


c 


4 35 . 


c 


436 . 




437 . 




4 38 . 




439 . 




440 . 




44 1. 




44 2. 




443 . 




444 . 




445 . 




446 . 




447 . 




448 . 


c 


449 . 


c 


450 . 


c 


451. 




452 . 




453 . 




454 . 




455 . 




456 . 




457 . 




458 . 




4 59 . 




460 . 




461 . 




462. 




463. 


c 


464 . 


c 


465. 


c 


466. 




467. 




468 . 




469 . 




470. 





DO 390 K=2,M 
KK=K~~1 

INDX=(KK-1 )*M1 + KK 
SDRVB( INDX)=ODO 
DO 370 J«1 ,N 
SUMWB = ODO 
DO 350 L=1 ,M 

IF(L.EO.K .OR. WB(J P K,L). LE.ODO) GOTO 350 
SUMWB=SUMWB + WB(J,K,L) 
350 CONTINUE 

SDRVB(INDX)=SDRVB(INDX) + B(J,K)*SUMWB 

370 CONTINUE 

SDRVB(INDX) - SDRVB(INDX) * 2D0 
390 CONTINUE 



SLOPE*INTRCP CROSS DER«IVS, 
DIFFERENT CALIBRATIONS 



INDX=0 

DO 490 K=2,M 
DO 470 L=2,M 
INDX-INDX+1 
IF(K.EO.^) GOTO 470 
SDRVB( INDX)=ODO 
DO 450 J=1 ,N 

SDRVB( INDX)=SDRVB( INDX) + B ( J , K ) *WB ( J , K , L ) 

450 CONTINUE 

SDRVB(INDX) - - SDRVB(INDX) * 2D0 
470 CONTINUE 
490 CONTINUE 



INTRCP DOUBLE DERIVS 



INDX=0 

DO 590 K=2,M 
' INDX=INDX + (K-1 ) 
SDRVC(INDX) - ODO 
DO 570 J=1 ,N 
DO 550 L=1 ,M 

IF(L.EO.K .OR. WB(J,K,L) .LE.ODO) GOTO 550 
SDRVC(INDX) = SDRVC(INDX) + WB(J,K,L) 
550 CONTINUE 
570 CONTINUE 

SDRVC(INDX) = SDRVC(INDX) * 2D0 ' 
590 CONTINUE 



INTRCP CROSS DERIVS 



INDX=0 

DO 690 K=2 t M s 
DO 670 L=2,K 
INDX^INDX + 1 

IF(L EO.K .OR. WB(J,K,L) .LE.ODO) GOTO 670 
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47 t . 

472 . 

473 . 

474 . 
475. 
476. 
477 . 
478. 
479 . 
480. 

481 . 

482 . 

483 . 

484 . 

485 . 

486 . 

487 . 

488 . 
489. 
490. 

491 . 

492 . 

493 . 

494 . 

495 . 

496 . 

497 . 

498 . 

499 . 
500. 

501 . 

502 . 

503 . 

504 . 

505 . 

506 . 

507 . 

508 . 

509 . 
510. 
511. 

512 . 

513 . 

514 . 
515. 

516 . 

517 . 
518. 
519 . 
520. 

521 . 

522 . 

523 . 



SDRVC(INDX) = 
DO 650 J=1 ,N 
SDRVC(INDX) 
650 CONTINUE 

SDRVC(INDX) = 
'670 CONTINUE 
690 CONTINUE 



ODO 



SDRVC( INDX) 
SDRVC(INDX) 



+ WB(J.K.L) 
* 2D0 



CALL AD<JRC ( SDRVA , SDRVB , SDRVC , SDRV , M 1 ,M1 ) 
IF(IDIAG.GT.O) CALL PPRNT ( SDRV , NP , NP . 1 , 8HSDRV 
RETURN 
END ' 
//GO. SYSIN DD * 
&INPUT N= 1 7 , M-4 , METHOD =3 , MAXITR = 20 , I NAME = lV _ 



•5- 


A45532 1 


1 5- 


B41532' 


1 5- 


B4 1732 ' 


•5- 


B31732' 


' 5- 


N00002 ' 


1 5- 


B1 1008' 


1 5- 


A71043* 


1 5- 


A2 1022 ' 


'5- 


B32632 ' 


»5- 


•K30004 ' 


•5- 


•K10010* 


•5- 


-B33232 ' 


1 5- 


-G43009 ' 


•5- 


-H12025' 


'5- 


-G20001 ' 


' 5- 


-K51020' 


'5- 


-A21032' 



SEND 
















(12,11 


, 4F8.3) 












41 


3 


046 


.360 


5 . 


263 




281 


51 - 


1 


687 


. 360 


5 


000 




750 


61 




185 


, 220 


6 


250 




781 


81 




413 


. 160 


4 


000 




480 


91 


2 


681 


. 320 


5 


556 




926 


101 


1 


107 


. 270 


10 


000 


2 


000 


1 11 


3 


756 


.440 


5 


263 




831 


121 




'227 


. 220 


6 


667 




889 


131 


1 


923 


. 280 


7 
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COMMENTS ON NAEP PUBLIC-USE DATA TAPES 

Due to the quantity of information provided, the NAEP tapes 
were in general cumbersome to work with. The documentation for 
the 1977/1978 and change item tapes can only be described as 
excellent, comprehensive in scope and accurate in detail. The 
files comprising the tapes were well-organized, the information 
about variable locations as well as that contained in the value . 
labels of the accompanying SPSS files was invaluable, and the 
classification of items contained in the appendices greatly 
facilitated the construction of our scales. In comparison, the 
1972/1973 tape was more difficult to work with, the organization 
and contents of the tape less readily understood. 

A few minor difficulties that we encountered bear mentioning. 
The difference between the "no response" and "missing values" 
classifications of item responses is unclear from the documenta- 
tion. The fact that in school and out of school 17 year olds are 
assigned different values for the region variable proved to be a 
source of temporary confusion. 

Data for items which were supposed to .be invariant across two 
or more age/year combinations, according to the documentation 
provided, occasionally did. not seem right. As an example, Item 
5-B3263'2 appeared in both the 13-year old and 17-year old instru- 
ments in 1977/78; as T1020 and S0921 respectively.. Proportions of 
cprrect response/however, suggested the item to be ..exteremly more 
diff icult for 17-year bids' /than ].3-year <olds , a- trend strongly 
•contradicting evidence from every other item linking these two age 

• 
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years. It is likely that data for the 17-year olds is in error 
here. Similar problems arose for items 5-A71043 and 5-N00002. 
Such questionable item/age/year data combinations were omitted 
our computations. 
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